WeRateDogs Data Wrangling & Analysis

Author: Xavier López
Date: December 2020
Objective: This notebook contains all the code that aims to wrangle data from different data sources and provide

Table of Contents

1. About this project

1.1 Context

The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs.

WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because *they're good dogs Brent*. WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent shared it to Udacity students to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017.

1.2 Project Motivation

The goal of this project is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

1.3 About the Data

We will be using data from three different sources:

  1. Twitter archive file
  2. Additional data on Twitter API
  3. Twitter image predictions file

1.Twitter archive file: It contains the coure of our data, it is available in data/twitter_archive_enhanced.csv.

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which is used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, tweets have been filtered for tweets with ratings only (there are 2356).

The data extractión has been done programmatically, but the author didn't do a very good job. The ratings probably aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. I'll need to assess and clean these columns if you want to use them for analysis and visualization.

2.Additional data on Twitter API: We will query Twitter's API to obtain the retweet count and favorite count information for each tweet which are not available in the Twitter archive file, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But we, because we have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. .

The setting up of the API is done in this notebook.

3.Twitter image predictions file: This data is Available at data/image-predictions.tsv

Every image in the WeRateDogs Twitter archive has been ran through a neural network that can classify breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

2 Data Wrangling

2.1 Data Gathering

In the following code reads data from the twitter archive, for every id queries the twitter API and generates a file collecting the entire twitter api data (tweet_json.txt) and red into dataframe format. Finally image predictions are also red.

The result is that data from different sources map to the following dataframes:

  1. Twitter archive file $\rightarrow$ df_ta
  2. Additional data on Twitter api $\rightarrow$ df_api
  3. Twitter image prediction file $\rightarrow$ df_pred
In [1]:
import pandas as pd
import numpy as np

import tweepy
import json

import sys

from datetime import date
import calendar


%matplotlib inline
import matplotlib.pyplot as plt

import altair as alt
from altair_saver import save

pd.options.display.max_colwidth = 2500

Code

In [2]:
#IMPORT CODE TO USE A PROGRESS BAR (used in generate_tweetdata_api)

from __future__ import print_function
import re


class ProgressBar(object):
    DEFAULT = 'Progress: %(bar)s %(percent)3d%%'
    FULL = '%(bar)s %(current)d/%(total)d (%(percent)3d%%) %(remaining)d to go'

    def __init__(self, total, width=40, fmt=DEFAULT, symbol='=',
                 output=sys.stderr):
        assert len(symbol) == 1

        self.total = total
        self.width = width
        self.symbol = symbol
        self.output = output
        self.fmt = re.sub(r'(?P<name>%\(.+?\))d',
            r'\g<name>%dd' % len(str(total)), fmt)

        self.current = 0

    def __call__(self):
        percent = self.current / float(self.total)
        size = int(self.width * percent)
        remaining = self.total - self.current
        bar = '[' + self.symbol * size + ' ' * (self.width - size) + ']'

        args = {
            'total': self.total,
            'bar': bar,
            'current': self.current,
            'percent': percent * 100,
            'remaining': remaining
        }
        print('\r' + self.fmt % args, file=self.output, end='')

    def done(self):
        self.current = self.total
        self()
        print('', file=self.output)
In [3]:
def get_tokens(api_keys_jsonfile):
    with open(api_keys_jsonfile) as json_file:
        data = json.load(json_file)
    return data["consumer_key"], data["consumer_secret"], data["access_token"], data["access_token_secret"]


def initialize_api(api_keys_jsonfile):
    
    consumer_key, consumer_secret, access_token, access_token_secret = get_tokens(api_keys_jsonfile)
    
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    
    api = tweepy.API(auth,wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    return api


def generate_tweetdata_api(api,tweet_ids):
    error_list = []
    tweet_id = []
    
    total = len(tweet_ids)
    progress = ProgressBar(total, fmt=ProgressBar.FULL)
    
    with open('data/tweet_json.txt', 'w+') as outfile:
        for tweet_id in tweet_ids:
            try:
                progress.current += 1
                progress()

                tweet = api.get_status(tweet_id)

                json.dump(tweet._json, outfile) #json.dump creates a json from the tweet
                                                #the outfile is an OPTIONAL parameter to export it direclty on a file
                outfile.write('\n')

            except:
                error_list.append(tweet_id)
    print("'data/tweet_json.txt' has been generated")

    
def get_tweet_api_exported_file():

    DF_tweets = pd.DataFrame()

    with open('data/tweet_json.txt', encoding='utf8', mode='r') as json_file:

        # iterate through each line
        for line in json_file:
            try:
                # read each json line into a dictionary
                data = json.loads(json_file.readline())

                df_tweet = pd.DataFrame(data)
                df_tweet = df_tweet[df_tweet.index == "name"].set_index("id").copy()

                DF_tweets = DF_tweets.append(df_tweet)
            except:
                print(str(json_file) + " could not be successfully processed")

    return DF_tweets[["created_at","id_str","text","truncated","source","user","retweet_count","favorite_count","lang"]]
In [4]:
print("\nRead twitter archive file")
df_ta = pd.read_csv('data/twitter-archive-enhanced.csv')
Read twitter archive file
In [5]:
print("\nGetting data from twitter api")
api = initialize_api('config/api_keys_tokens.txt')
generate_tweetdata_api(api, df_ta.tweet_id)
Getting data from twitter api
[===============                         ]  901/2356 ( 38%) 1455 to goRate limit reached. Sleeping for: 262
[==============================          ] 1801/2356 ( 76%)  555 to goRate limit reached. Sleeping for: 318
[========================================] 2356/2356 (100%)    0 to go
'data/tweet_json.txt' has been generated
In [6]:
print("\nReading data from the exported file")
df_api = get_tweet_api_exported_file()
df_api
print("\nData from 'config/api_keys_tokens.txt' has been successfully red")
Reading data from the exported file
<_io.TextIOWrapper name='data/tweet_json.txt' mode='r' encoding='utf8'> could not be successfully processed

Data from 'config/api_keys_tokens.txt' has been successfully red
In [7]:
print("\nRead predictions data")
try :
    df_pred = pd.read_csv('data/image-predictions.tsv', delimiter="\t")
    print("\nPredictions data has been succesfully red")
except:
    print("\n someting went wrong")
Read predictions data

Predictions data has been succesfully red
In [8]:
df_ta.head(5)
Out[8]:
tweet_id in_reply_to_status_id in_reply_to_user_id timestamp source text retweeted_status_id retweeted_status_user_id retweeted_status_timestamp expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo
0 892420643555336193 NaN NaN 2017-08-01 16:23:56 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU NaN NaN NaN https://twitter.com/dog_rates/status/892420643555336193/photo/1 13 10 Phineas None None None None
1 892177421306343426 NaN NaN 2017-08-01 00:17:27 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV NaN NaN NaN https://twitter.com/dog_rates/status/892177421306343426/photo/1 13 10 Tilly None None None None
2 891815181378084864 NaN NaN 2017-07-31 00:18:03 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB NaN NaN NaN https://twitter.com/dog_rates/status/891815181378084864/photo/1 12 10 Archie None None None None
3 891689557279858688 NaN NaN 2017-07-30 15:58:51 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ NaN NaN NaN https://twitter.com/dog_rates/status/891689557279858688/photo/1 13 10 Darla None None None None
4 891327558926688256 NaN NaN 2017-07-29 16:00:24 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f NaN NaN NaN https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 12 10 Franklin None None None None
In [9]:
df_api.head(5)
Out[9]:
created_at id_str text truncated source user retweet_count favorite_count lang
id
892177421306343426 Tue Aug 01 00:17:27 +0000 2017 892177421306343426 This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boop… https://t.co/aQFSeaCu9L True <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> WeRateDogs® 5531 30525 en
891689557279858688 Sun Jul 30 15:58:51 +0000 2017 891689557279858688 This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ False <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> WeRateDogs® 7618 38575 en
891087950875897856 Sat Jul 29 00:08:17 +0000 2017 891087950875897856 Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG:… https://t.co/xx5cilW0Dd True <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> WeRateDogs® 2752 18580 en
890729181411237888 Fri Jul 28 00:22:40 +0000 2017 890729181411237888 When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13… https://t.co/hrcFOGi12V True <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> WeRateDogs® 16647 59463 en
890240255349198849 Wed Jul 26 15:59:51 +0000 2017 890240255349198849 This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant… https://t.co/l3TSS3o2M0 True <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> WeRateDogs® 6454 29174 en
In [10]:
df_pred.head(5)
Out[10]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

2.2 Data Assessing & Cleaning

2.2.1 Assessing Twitter archive file

Test

In [11]:
df_ta.head(5).T
Out[11]:
0 1 2 3 4
tweet_id 892420643555336193 892177421306343426 891815181378084864 891689557279858688 891327558926688256
in_reply_to_status_id NaN NaN NaN NaN NaN
in_reply_to_user_id NaN NaN NaN NaN NaN
timestamp 2017-08-01 16:23:56 +0000 2017-08-01 00:17:27 +0000 2017-07-31 00:18:03 +0000 2017-07-30 15:58:51 +0000 2017-07-29 16:00:24 +0000
source <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
text This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f
retweeted_status_id NaN NaN NaN NaN NaN
retweeted_status_user_id NaN NaN NaN NaN NaN
retweeted_status_timestamp NaN NaN NaN NaN NaN
expanded_urls https://twitter.com/dog_rates/status/892420643555336193/photo/1 https://twitter.com/dog_rates/status/892177421306343426/photo/1 https://twitter.com/dog_rates/status/891815181378084864/photo/1 https://twitter.com/dog_rates/status/891689557279858688/photo/1 https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1
rating_numerator 13 13 12 13 12
rating_denominator 10 10 10 10 10
name Phineas Tilly Archie Darla Franklin
doggo None None None None None
floofer None None None None None
pupper None None None None None
puppo None None None None None

Recall that this dataset is the core of our data, first we will begin analyzing which columns contain valuable information and dropping the useless ones.

  • The columns:

    • in_reply_to_status_id
    • in_reply_to_user_id
    • retweeted_status_id
    • retweeted_status_user_id
    • retweeted_status_timestamp

      Contain mostly nulls and not useful information so we will drop them

Test

In [12]:
len(df_ta["retweeted_status_id"].unique()) #get number of distinct values
Out[12]:
182
In [13]:
df_ta["retweeted_status_id"].isnull().sum() #get numbers of nulls in column
Out[13]:
2175
In [14]:
df_ta["retweeted_status_id"].unique()[:10] #get first 10 distincts elements to see how the non na data is
Out[14]:
array([           nan, 8.87473957e+17, 8.86053734e+17, 8.30583321e+17,
       8.78057613e+17, 8.78281511e+17, 6.69000397e+17, 8.76850772e+17,
       8.66334965e+17, 8.68880398e+17])

Those columns are not actually completely useless, when they are not null it is because the tweet is actually retweeted.

  • The goal of the project is to drop retweets and analyze only pure tweets (to avoid duplicity) therefore we will select only rows that have null value sin the retweet related columns:
    • retweeted_status_id
    • retweeted_status_user_id
    • retweeted_status_timestamp
  • A different issue is that the information of one feature (dog stage) is stored across multiple columns (doggo, floofer, pupper and puppo), this information should be encoded in a single column.

    See below information on dog stages

The Dogtionary explains the various stages of dog: doggo, pupper, puppo, and floof(er) (via the #WeRateDogs book on Amazon)

In [15]:
df_ta[["doggo", "floofer", "pupper", "puppo"]]
Out[15]:
doggo floofer pupper puppo
0 None None None None
1 None None None None
2 None None None None
3 None None None None
4 None None None None
... ... ... ... ...
2351 None None None None
2352 None None None None
2353 None None None None
2354 None None None None
2355 None None None None

2356 rows × 4 columns

  • Another issue is that the rating of a tweet is encoded in two columns, for simplification we will generate a single column tweet ranking to easily compare.

Test

In [16]:
df_ta[["rating_numerator","rating_denominator"]]
Out[16]:
rating_numerator rating_denominator
0 13 10
1 13 10
2 12 10
3 13 10
4 12 10
... ... ...
2351 5 10
2352 6 10
2353 9 10
2354 7 10
2355 8 10

2356 rows × 2 columns

  • After checking the column source

Test

In [17]:
set(df_ta.source)
Out[17]:
{'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
 '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
 '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'}

There are 4 different values for source, but those names are too long, we will map them to 4 more easy to read categories:

- web client
- iphone
- vine
- tweet deck
  • The field name has miss informed values, particularly any value that does not start with a capital letter seems to be missleabeled (exmple name officially, very or such), this misslabeled values should be relabeled to Unknown

Test

In [18]:
set(df_ta.name)
Out[18]:
{'Abby',
 'Ace',
 'Acro',
 'Adele',
 'Aiden',
 'Aja',
 'Akumi',
 'Al',
 'Albert',
 'Albus',
 'Aldrick',
 'Alejandro',
 'Alexander',
 'Alexanderson',
 'Alf',
 'Alfie',
 'Alfy',
 'Alice',
 'Amber',
 'Ambrose',
 'Amy',
 'Amélie',
 'Anakin',
 'Andru',
 'Andy',
 'Angel',
 'Anna',
 'Anthony',
 'Antony',
 'Apollo',
 'Aqua',
 'Archie',
 'Arlen',
 'Arlo',
 'Arnie',
 'Arnold',
 'Arya',
 'Ash',
 'Asher',
 'Ashleigh',
 'Aspen',
 'Astrid',
 'Atlas',
 'Atticus',
 'Aubie',
 'Augie',
 'Autumn',
 'Ava',
 'Axel',
 'Bailey',
 'Baloo',
 'Balto',
 'Banditt',
 'Banjo',
 'Barclay',
 'Barney',
 'Baron',
 'Barry',
 'Batdog',
 'Bauer',
 'Baxter',
 'Bayley',
 'BeBe',
 'Bear',
 'Beau',
 'Beckham',
 'Beebop',
 'Beemo',
 'Bell',
 'Bella',
 'Belle',
 'Ben',
 'Benedict',
 'Benji',
 'Benny',
 'Bentley',
 'Berb',
 'Berkeley',
 'Bernie',
 'Bert',
 'Bertson',
 'Betty',
 'Beya',
 'Biden',
 'Bilbo',
 'Billl',
 'Billy',
 'Binky',
 'Birf',
 'Bisquick',
 'Blakely',
 'Blanket',
 'Blipson',
 'Blitz',
 'Bloo',
 'Bloop',
 'Blu',
 'Blue',
 'Bluebert',
 'Bo',
 'Bob',
 'Bobb',
 'Bobbay',
 'Bobble',
 'Bobby',
 'Bode',
 'Bodie',
 'Bonaparte',
 'Bones',
 'Bookstore',
 'Boomer',
 'Boots',
 'Boston',
 'Bowie',
 'Brad',
 'Bradlay',
 'Bradley',
 'Brady',
 'Brandi',
 'Brandonald',
 'Brandy',
 'Brat',
 'Brian',
 'Brockly',
 'Brody',
 'Bronte',
 'Brooks',
 'Brownie',
 'Bruce',
 'Brudge',
 'Bruiser',
 'Bruno',
 'Brutus',
 'Bubba',
 'Bubbles',
 'Buckley',
 'Buddah',
 'Buddy',
 'Bungalo',
 'Burt',
 'Butter',
 'Butters',
 'Cal',
 'Calbert',
 'Cali',
 'Callie',
 'Calvin',
 'Canela',
 'Cannon',
 'Carbon',
 'Carl',
 'Carll',
 'Carly',
 'Carper',
 'Carter',
 'Caryl',
 'Cash',
 'Cassie',
 'CeCe',
 'Cecil',
 'Cedrick',
 'Cermet',
 'Chadrick',
 'Champ',
 'Charl',
 'Charles',
 'Charleson',
 'Charlie',
 'Chase',
 'Chaz',
 'Cheesy',
 'Chef',
 'Chelsea',
 'Cheryl',
 'Chesney',
 'Chester',
 'Chesterson',
 'Chet',
 'Chevy',
 'Chip',
 'Chipson',
 'Chloe',
 'Chompsky',
 'Christoper',
 'Chubbs',
 'Chuck',
 'Chuckles',
 'Chuq',
 'Churlie',
 'Cilantro',
 'Clarence',
 'Clark',
 'Clarkus',
 'Clarq',
 'Claude',
 'Cleopatricia',
 'Clifford',
 'Clybe',
 'Clyde',
 'Coco',
 'Cody',
 'Colby',
 'Coleman',
 'Colin',
 'Combo',
 'Comet',
 'Cooper',
 'Coops',
 'Coopson',
 'Cora',
 'Corey',
 'Covach',
 'Craig',
 'Crawford',
 'Creg',
 'Crimson',
 'Crouton',
 'Crumpet',
 'Crystal',
 'Cuddles',
 'Cupcake',
 'Cupid',
 'Curtis',
 'Daisy',
 'Dakota',
 'Dale',
 'Dallas',
 'Damon',
 'Daniel',
 'Danny',
 'Dante',
 'Darby',
 'Darla',
 'Darrel',
 'Dash',
 'Dave',
 'Davey',
 'Dawn',
 'DayZ',
 'Deacon',
 'Derby',
 'Derek',
 'Devón',
 'Dewey',
 'Dex',
 'Dexter',
 'Dido',
 'Dietrich',
 'Diogi',
 'Divine',
 'Dixie',
 'Django',
 'Dobby',
 'Doc',
 'DonDon',
 'Donny',
 'Doobert',
 'Dook',
 'Dot',
 'Dotsy',
 'Doug',
 'Duchess',
 'Duddles',
 'Dudley',
 'Dug',
 'Duke',
 'Dunkin',
 'Durg',
 'Dutch',
 'Dwight',
 'Dylan',
 'Earl',
 'Eazy',
 'Ebby',
 'Ed',
 'Edd',
 'Edgar',
 'Edmund',
 'Eevee',
 'Einstein',
 'Eleanor',
 'Eli',
 'Ellie',
 'Elliot',
 'Emanuel',
 'Ember',
 'Emma',
 'Emmie',
 'Emmy',
 'Enchilada',
 'Erik',
 'Eriq',
 'Ester',
 'Eugene',
 'Eve',
 'Evy',
 'Fabio',
 'Farfle',
 'Ferg',
 'Fido',
 'Fiji',
 'Fillup',
 'Filup',
 'Finley',
 'Finn',
 'Finnegus',
 'Fiona',
 'Fizz',
 'Flash',
 'Fletcher',
 'Florence',
 'Flurpson',
 'Flávio',
 'Frank',
 'Frankie',
 'Franklin',
 'Franq',
 'Fred',
 'Freddery',
 'Frönq',
 'Furzey',
 'Fwed',
 'Fynn',
 'Gabby',
 'Gabe',
 'Gary',
 'General',
 'Genevieve',
 'Geno',
 'Geoff',
 'George',
 'Georgie',
 'Gerald',
 'Gerbald',
 'Gert',
 'Gidget',
 'Gilbert',
 'Gin',
 'Ginger',
 'Gizmo',
 'Glacier',
 'Glenn',
 'Godi',
 'Godzilla',
 'Goliath',
 'Goose',
 'Gordon',
 'Grady',
 'Grey',
 'Griffin',
 'Griswold',
 'Grizz',
 'Grizzie',
 'Grizzwald',
 'Gromit',
 'Gunner',
 'Gus',
 'Gustaf',
 'Gustav',
 'Gòrdón',
 'Hall',
 'Halo',
 'Hammond',
 'Hamrick',
 'Hank',
 'Hanz',
 'Happy',
 'Harlso',
 'Harnold',
 'Harold',
 'Harper',
 'Harrison',
 'Harry',
 'Harvey',
 'Hazel',
 'Hector',
 'Heinrich',
 'Henry',
 'Herald',
 'Herb',
 'Hercules',
 'Herm',
 'Hermione',
 'Hero',
 'Herschel',
 'Hobbes',
 'Holly',
 'Horace',
 'Howie',
 'Hubertson',
 'Huck',
 'Humphrey',
 'Hunter',
 'Hurley',
 'Huxley',
 'Iggy',
 'Ike',
 'Indie',
 'Iroh',
 'Ito',
 'Ivar',
 'Izzy',
 'JD',
 'Jack',
 'Jackie',
 'Jackson',
 'Jameson',
 'Jamesy',
 'Jangle',
 'Jareld',
 'Jarod',
 'Jarvis',
 'Jaspers',
 'Jax',
 'Jay',
 'Jaycob',
 'Jazz',
 'Jazzy',
 'Jeb',
 'Jebberson',
 'Jed',
 'Jeffrey',
 'Jeffri',
 'Jeffrie',
 'Jennifur',
 'Jeph',
 'Jeremy',
 'Jerome',
 'Jerry',
 'Jersey',
 'Jesse',
 'Jessifer',
 'Jessiga',
 'Jett',
 'Jim',
 'Jimbo',
 'Jiminus',
 'Jiminy',
 'Jimison',
 'Jimothy',
 'Jo',
 'Jockson',
 'Joey',
 'Jomathan',
 'Jonah',
 'Jordy',
 'Josep',
 'Joshwa',
 'Juckson',
 'Julio',
 'Julius',
 'Juno',
 'Kaia',
 'Kaiya',
 'Kallie',
 'Kane',
 'Kanu',
 'Kara',
 'Karl',
 'Karll',
 'Karma',
 'Kathmandu',
 'Katie',
 'Kawhi',
 'Kayla',
 'Keet',
 'Keith',
 'Kellogg',
 'Ken',
 'Kendall',
 'Kenneth',
 'Kenny',
 'Kenzie',
 'Keurig',
 'Kevin',
 'Kevon',
 'Kial',
 'Kilo',
 'Kingsley',
 'Kirby',
 'Kirk',
 'Klein',
 'Klevin',
 'Kloey',
 'Kobe',
 'Koda',
 'Kody',
 'Koko',
 'Kollin',
 'Kona',
 'Kota',
 'Kramer',
 'Kreg',
 'Kreggory',
 'Kulet',
 'Kuyu',
 'Kyle',
 'Kyro',
 'Lacy',
 'Laela',
 'Laika',
 'Lambeau',
 'Lance',
 'Larry',
 'Lassie',
 'Layla',
 'Leela',
 'Lennon',
 'Lenny',
 'Lenox',
 'Leo',
 'Leonard',
 'Leonidas',
 'Levi',
 'Liam',
 'Lilah',
 'Lili',
 'Lilli',
 'Lillie',
 'Lilly',
 'Lily',
 'Lincoln',
 'Linda',
 'Link',
 'Linus',
 'Lipton',
 'Livvie',
 'Lizzie',
 'Logan',
 'Loki',
 'Lola',
 'Lolo',
 'Longfellow',
 'Loomis',
 'Lorelei',
 'Lorenzo',
 'Lou',
 'Louie',
 'Louis',
 'Luca',
 'Lucia',
 'Lucky',
 'Lucy',
 'Lugan',
 'Lulu',
 'Luna',
 'Lupe',
 'Luther',
 'Mabel',
 'Mac',
 'Mack',
 'Maddie',
 'Maggie',
 'Mairi',
 'Maisey',
 'Major',
 'Maks',
 'Malcolm',
 'Malikai',
 'Margo',
 'Mark',
 'Marlee',
 'Marley',
 'Marq',
 'Marty',
 'Marvin',
 'Mary',
 'Mason',
 'Mattie',
 'Maude',
 'Mauve',
 'Max',
 'Maxaroni',
 'Maximus',
 'Maxwell',
 'Maya',
 'Meatball',
 'Meera',
 'Meyer',
 'Mia',
 'Michelangelope',
 'Miguel',
 'Mike',
 'Miley',
 'Milky',
 'Millie',
 'Milo',
 'Mimosa',
 'Mingus',
 'Mister',
 'Misty',
 'Mitch',
 'Mo',
 'Moe',
 'Mojo',
 'Mollie',
 'Molly',
 'Mona',
 'Monkey',
 'Monster',
 'Monty',
 'Moofasa',
 'Mookie',
 'Moose',
 'Moreton',
 'Mosby',
 'Murphy',
 'Mutt',
 'Mya',
 'Nala',
 'Naphaniel',
 'Napolean',
 'Nelly',
 'Neptune',
 'Newt',
 'Nico',
 'Nida',
 'Nigel',
 'Nimbus',
 'Noah',
 'Nollie',
 'None',
 'Noosh',
 'Norman',
 'Nugget',
 'O',
 'Oakley',
 'Obi',
 'Obie',
 'Oddie',
 'Odie',
 'Odin',
 'Olaf',
 'Ole',
 'Olive',
 'Oliver',
 'Olivia',
 'Oliviér',
 'Ollie',
 'Opal',
 'Opie',
 'Oreo',
 'Orion',
 'Oscar',
 'Oshie',
 'Otis',
 'Ozzie',
 'Ozzy',
 'Pablo',
 'Paisley',
 'Pancake',
 'Panda',
 'Patch',
 'Patrick',
 'Paull',
 'Pavlov',
 'Pawnd',
 'Peaches',
 'Peanut',
 'Penelope',
 'Penny',
 'Pepper',
 'Percy',
 'Perry',
 'Pete',
 'Petrick',
 'Pherb',
 'Phil',
 'Philbert',
 'Philippe',
 'Phineas',
 'Phred',
 'Pickles',
 'Pilot',
 'Pinot',
 'Pip',
 'Piper',
 'Pippa',
 'Pippin',
 'Pipsy',
 'Pluto',
 'Poppy',
 'Pubert',
 'Puff',
 'Pumpkin',
 'Pupcasso',
 'Quinn',
 'Ralf',
 'Ralph',
 'Ralpher',
 'Ralphie',
 'Ralphson',
 'Ralphus',
 'Ralphy',
 'Ralphé',
 'Rambo',
 'Randall',
 'Raphael',
 'Rascal',
 'Raymond',
 'Reagan',
 'Reese',
 'Reggie',
 'Reginald',
 'Remington',
 'Remus',
 'Remy',
 'Reptar',
 'Rey',
 'Rhino',
 'Richie',
 'Ricky',
 'Ridley',
 'Riley',
 'Rilo',
 'Rinna',
 'River',
 'Rizzo',
 'Rizzy',
 'Robin',
 'Rocco',
 'Rocky',
 'Rodman',
 'Rodney',
 'Rolf',
 'Romeo',
 'Ron',
 'Ronduh',
 'Ronnie',
 'Rontu',
 'Rooney',
 'Roosevelt',
 'Rorie',
 'Rory',
 'Roscoe',
 'Rose',
 'Rosie',
 'Rover',
 'Rubio',
 'Ruby',
 'Rudy',
 'Rueben',
 'Ruffles',
 'Rufio',
 'Rufus',
 'Rumble',
 'Rumpole',
 'Rupert',
 'Rusty',
 'Sadie',
 'Sage',
 'Sailer',
 'Sailor',
 'Sam',
 'Sammy',
 'Sampson',
 'Samsom',
 'Samson',
 'Sandra',
 'Sandy',
 'Sansa',
 'Sarge',
 'Saydee',
 'Schnitzel',
 'Schnozz',
 'Scooter',
 'Scott',
 'Scout',
 'Scruffers',
 'Seamus',
 'Sebastian',
 'Sephie',
 'Severus',
 'Shadoe',
 'Shadow',
 'Shaggy',
 'Shakespeare',
 'Shawwn',
 'Shelby',
 'Shikha',
 'Shiloh',
 'Shnuggles',
 'Shooter',
 'Siba',
 'Sid',
 'Sierra',
 'Simba',
 'Skittle',
 'Skittles',
 'Sky',
 'Skye',
 'Smiley',
 'Smokey',
 'Snickers',
 'Snicku',
 'Snoop',
 'Snoopy',
 'Sobe',
 'Socks',
 'Sojourner',
 'Solomon',
 'Sonny',
 'Sophie',
 'Sora',
 'Spanky',
 'Spark',
 'Sparky',
 'Spencer',
 'Sprinkles',
 'Sprout',
 'Staniel',
 'Stanley',
 'Stark',
 'Stefan',
 'Stella',
 'Stephan',
 'Stephanus',
 'Steve',
 'Steven',
 'Stewie',
 'Storkson',
 'Stormy',
 'Strider',
 'Striker',
 'Strudel',
 'Stu',
 'Stuart',
 'Stubert',
 'Sugar',
 'Suki',
 'Sully',
 'Sundance',
 'Sunny',
 'Sunshine',
 'Superpup',
 'Swagger',
 'Sweet',
 'Sweets',
 'Taco',
 'Tango',
 'Tanner',
 'Tassy',
 'Tater',
 'Tayzie',
 'Taz',
 'Tebow',
 'Ted',
 'Tedders',
 'Teddy',
 'Tedrick',
 'Terrance',
 'Terrenth',
 'Terry',
 'Tess',
 'Tessa',
 'Theo',
 'Theodore',
 'Thor',
 'Thumas',
 'Tiger',
 'Tilly',
 'Timber',
 'Timison',
 'Timmy',
 'Timofy',
 'Tino',
 'Titan',
 'Tito',
 'Tobi',
 'Toby',
 'Todo',
 'Toffee',
 'Tom',
 'Tommy',
 'Tonks',
 'Torque',
 'Tove',
 'Travis',
 'Traviss',
 'Trevith',
 'Trigger',
 'Trip',
 'Tripp',
 'Trooper',
 'Tuck',
 'Tucker',
 'Tuco',
 'Tug',
 'Tupawc',
 'Tycho',
 'Tyr',
 'Tyrone',
 'Tyrus',
 'Ulysses',
 'Venti',
 'Vince',
 'Vincent',
 'Vinnie',
 'Vinscent',
 'Vixen',
 'Wafer',
 'Waffles',
 'Walker',
 'Wallace',
 'Wally',
 'Walter',
 'Watson',
 'Wesley',
 'Wiggles',
 'Willem',
 'William',
 'Willie',
 'Willow',
 'Willy',
 'Wilson',
 'Winifred',
 'Winnie',
 'Winston',
 'Wishes',
 'Wyatt',
 'Yoda',
 'Yogi',
 'Yukon',
 'Zara',
 'Zeek',
 'Zeke',
 'Zeus',
 'Ziva',
 'Zoe',
 'Zoey',
 'Zooey',
 'Zuzu',
 'a',
 'actually',
 'all',
 'an',
 'by',
 'getting',
 'his',
 'incredibly',
 'infuriating',
 'just',
 'life',
 'light',
 'mad',
 'my',
 'not',
 'officially',
 'old',
 'one',
 'quite',
 'space',
 'such',
 'the',
 'this',
 'unacceptable',
 'very'}
  • The expanded_urls column is giving more than one vlaue for tweets with more than one image, and those values are repeated separated by columns.

    I will take a simplified approach where we will get a single url for each tweet (the first element of the list or first photo).

Test

In [19]:
df_ta.expanded_urls[10].split(",")
Out[19]:
['https://twitter.com/dog_rates/status/890006608113172480/photo/1',
 'https://twitter.com/dog_rates/status/890006608113172480/photo/1']
In [20]:
df_ta.expanded_urls[4].split(",")
Out[20]:
['https://twitter.com/dog_rates/status/891327558926688256/photo/1',
 'https://twitter.com/dog_rates/status/891327558926688256/photo/1']
  • The timestamp column contains very valuable information, however it can be hard to consume. We wold like to get:
    • The day of the week
    • The hour (only the hour, not min, not second, in a single column)
    • The year
    • The month/year (calmonth)
    • The day/month/year (calday)
In [21]:
df_ta.timestamp[0].split(" ")
Out[21]:
['2017-08-01', '16:23:56', '+0000']
In [22]:
set([x.split(" ")[1] for x in df_ta.timestamp])
Out[22]:
{'16:13:44',
 '15:26:30',
 '15:40:26',
 '18:42:20',
 '03:47:50',
 '21:49:15',
 '18:51:11',
 '01:42:53',
 '15:55:59',
 '23:58:35',
 '03:09:55',
 '02:08:22',
 '00:34:33',
 '18:41:02',
 '19:55:35',
 '17:41:18',
 '18:52:38',
 '00:59:15',
 '01:39:11',
 '23:53:08',
 '20:33:19',
 '01:11:51',
 '02:56:28',
 '03:03:06',
 '19:11:53',
 '23:06:23',
 '17:33:49',
 '02:40:19',
 '22:00:08',
 '19:24:28',
 '21:13:35',
 '03:08:17',
 '02:21:21',
 '04:03:02',
 '03:13:11',
 '17:01:34',
 '00:29:39',
 '16:37:54',
 '21:00:04',
 '01:39:49',
 '23:34:55',
 '00:20:47',
 '23:58:40',
 '22:31:36',
 '19:05:49',
 '16:06:48',
 '20:56:55',
 '01:10:04',
 '04:01:37',
 '21:14:20',
 '17:42:34',
 '16:10:20',
 '03:30:58',
 '16:09:56',
 '00:14:12',
 '19:11:49',
 '17:40:04',
 '19:21:47',
 '02:25:23',
 '18:09:09',
 '02:26:00',
 '03:21:00',
 '02:42:26',
 '17:23:04',
 '20:00:23',
 '23:01:59',
 '16:34:32',
 '00:47:59',
 '02:36:57',
 '03:02:47',
 '19:22:56',
 '01:05:02',
 '17:02:17',
 '15:37:03',
 '02:10:39',
 '03:20:44',
 '01:06:33',
 '17:31:20',
 '02:09:56',
 '19:24:02',
 '19:31:20',
 '17:02:04',
 '16:10:44',
 '21:34:37',
 '22:04:05',
 '00:04:38',
 '22:57:10',
 '15:51:24',
 '17:14:23',
 '02:46:49',
 '00:39:48',
 '19:05:32',
 '15:57:30',
 '03:44:34',
 '18:49:22',
 '03:57:26',
 '04:56:16',
 '01:44:52',
 '17:57:57',
 '16:08:03',
 '06:37:25',
 '23:34:00',
 '01:29:02',
 '01:25:36',
 '00:18:10',
 '02:30:23',
 '16:24:37',
 '22:42:52',
 '22:03:49',
 '00:49:30',
 '00:18:04',
 '21:06:00',
 '23:04:14',
 '02:53:11',
 '02:57:26',
 '01:20:49',
 '01:54:34',
 '20:14:22',
 '16:14:48',
 '19:39:34',
 '15:17:01',
 '00:30:50',
 '16:30:45',
 '19:03:06',
 '04:01:58',
 '19:24:27',
 '04:27:31',
 '15:14:19',
 '00:16:48',
 '02:08:05',
 '15:58:47',
 '21:00:12',
 '03:39:15',
 '02:21:26',
 '03:38:27',
 '03:13:46',
 '00:43:49',
 '03:06:01',
 '16:51:59',
 '00:43:25',
 '03:33:58',
 '03:24:40',
 '02:42:51',
 '00:54:06',
 '04:27:59',
 '23:36:44',
 '22:36:19',
 '03:55:21',
 '22:09:14',
 '20:38:19',
 '18:17:33',
 '04:27:09',
 '15:55:58',
 '04:35:10',
 '20:40:41',
 '00:53:56',
 '03:46:05',
 '04:40:46',
 '00:00:38',
 '00:14:32',
 '17:51:44',
 '16:00:12',
 '19:13:01',
 '18:17:08',
 '18:53:24',
 '23:37:28',
 '20:30:30',
 '18:54:34',
 '02:47:04',
 '01:24:33',
 '00:18:35',
 '22:00:52',
 '01:31:38',
 '16:08:30',
 '16:09:20',
 '02:30:43',
 '18:47:24',
 '22:20:06',
 '01:08:55',
 '00:46:20',
 '21:39:54',
 '01:03:14',
 '00:25:14',
 '20:17:59',
 '18:17:59',
 '00:37:03',
 '02:51:54',
 '22:24:31',
 '17:33:48',
 '16:04:13',
 '01:38:42',
 '01:31:12',
 '01:15:49',
 '01:18:40',
 '17:00:26',
 '23:54:05',
 '17:50:56',
 '04:23:49',
 '01:01:59',
 '13:11:05',
 '00:21:08',
 '01:48:55',
 '01:25:10',
 '23:25:35',
 '20:09:54',
 '01:21:40',
 '17:52:38',
 '02:53:17',
 '02:27:27',
 '01:22:10',
 '01:24:27',
 '00:25:18',
 '17:36:50',
 '18:07:47',
 '03:11:35',
 '00:01:46',
 '18:39:05',
 '22:41:22',
 '17:52:40',
 '01:34:21',
 '01:52:36',
 '16:22:55',
 '23:00:11',
 '19:14:50',
 '01:35:01',
 '23:10:06',
 '16:41:12',
 '00:06:39',
 '03:30:07',
 '00:17:55',
 '01:11:29',
 '00:58:13',
 '19:02:24',
 '04:14:13',
 '17:42:10',
 '18:35:39',
 '00:27:14',
 '02:41:01',
 '05:08:29',
 '23:32:35',
 '23:00:17',
 '00:04:50',
 '01:44:00',
 '17:05:31',
 '17:01:14',
 '00:22:57',
 '01:23:05',
 '02:47:56',
 '00:57:27',
 '15:08:56',
 '18:56:45',
 '02:10:14',
 '03:01:06',
 '18:43:31',
 '03:37:31',
 '16:06:04',
 '23:51:49',
 '17:38:09',
 '03:45:53',
 '17:37:00',
 '02:54:41',
 '02:41:12',
 '01:24:35',
 '00:41:48',
 '23:29:14',
 '00:59:46',
 '16:57:35',
 '19:55:30',
 '01:03:45',
 '16:00:17',
 '01:42:09',
 '17:27:23',
 '15:32:42',
 '00:18:03',
 '00:13:04',
 '02:01:49',
 '02:48:31',
 '16:08:44',
 '00:15:37',
 '01:00:07',
 '00:04:21',
 '18:13:27',
 '03:29:49',
 '18:02:38',
 '01:04:45',
 '22:15:26',
 '01:04:29',
 '00:32:26',
 '01:17:51',
 '03:42:44',
 '18:55:51',
 '01:40:38',
 '00:30:04',
 '00:59:40',
 '23:47:49',
 '02:45:22',
 '22:55:23',
 '19:04:15',
 '02:52:03',
 '14:20:41',
 '00:27:39',
 '15:59:17',
 '00:57:20',
 '02:40:05',
 '15:58:53',
 '01:10:13',
 '02:22:29',
 '18:29:43',
 '02:10:24',
 '22:01:40',
 '01:52:02',
 '16:00:13',
 '18:01:07',
 '17:04:02',
 '05:07:29',
 '19:04:19',
 '17:50:33',
 '21:41:44',
 '01:41:58',
 '17:00:25',
 '02:06:27',
 '17:51:04',
 '01:04:13',
 '03:29:07',
 '20:26:26',
 '02:46:44',
 '17:00:21',
 '01:33:08',
 '16:04:20',
 '04:35:39',
 '17:31:15',
 '16:14:55',
 '04:31:49',
 '17:10:04',
 '17:22:24',
 '04:14:59',
 '22:15:21',
 '15:53:19',
 '22:54:18',
 '01:49:03',
 '03:10:43',
 '00:04:57',
 '16:10:40',
 '00:54:05',
 '00:32:32',
 '03:36:28',
 '02:10:37',
 '01:22:35',
 '02:58:09',
 '17:25:59',
 '22:49:15',
 '17:23:57',
 '00:26:15',
 '03:11:30',
 '16:01:23',
 '22:30:44',
 '23:55:38',
 '16:53:37',
 '01:59:36',
 '01:48:22',
 '17:53:31',
 '18:00:41',
 '16:08:50',
 '13:24:20',
 '16:22:16',
 '22:32:36',
 '22:17:55',
 '01:37:30',
 '20:03:43',
 '01:28:25',
 '00:23:06',
 '21:29:33',
 '17:58:03',
 '03:55:04',
 '01:00:34',
 '20:01:55',
 '00:19:04',
 '16:54:09',
 '01:36:26',
 '00:54:28',
 '20:21:02',
 '15:27:17',
 '16:15:54',
 '00:52:45',
 '19:22:09',
 '02:56:22',
 '02:57:52',
 '03:46:11',
 '17:13:02',
 '02:29:37',
 '02:19:31',
 '03:26:43',
 '15:57:56',
 '15:54:28',
 '15:58:28',
 '16:03:16',
 '01:47:28',
 '00:13:52',
 '02:08:07',
 '20:47:17',
 '16:01:13',
 '22:44:42',
 '17:28:22',
 '21:18:05',
 '00:13:17',
 '02:21:04',
 '18:52:06',
 '02:21:29',
 '18:42:44',
 '00:16:10',
 '16:14:40',
 '17:07:18',
 '01:07:28',
 '01:40:41',
 '22:48:24',
 '19:00:02',
 '05:25:42',
 '23:24:56',
 '00:54:46',
 '19:23:13',
 '18:39:13',
 '00:04:08',
 '01:59:39',
 '02:23:09',
 '04:00:18',
 '16:10:29',
 '04:44:55',
 '22:52:02',
 '23:43:18',
 '01:42:24',
 '17:28:39',
 '01:13:34',
 '00:30:51',
 '02:53:48',
 '15:59:24',
 '21:58:53',
 '02:06:59',
 '01:54:44',
 '01:25:31',
 '02:09:53',
 '21:26:58',
 '01:49:05',
 '16:02:49',
 '00:25:26',
 '21:24:36',
 '20:58:07',
 '01:35:24',
 '15:58:11',
 '20:19:52',
 '19:32:29',
 '15:19:12',
 '04:09:13',
 '00:00:07',
 '15:40:07',
 '02:37:35',
 '03:51:38',
 '02:30:58',
 '00:00:04',
 '03:18:15',
 '20:48:40',
 '15:00:16',
 '00:37:52',
 '02:46:29',
 '18:09:23',
 '02:20:45',
 '20:08:52',
 '23:56:03',
 '03:28:25',
 '18:15:55',
 '02:31:10',
 '22:06:57',
 '01:44:13',
 '01:20:08',
 '01:00:55',
 '23:58:41',
 '20:47:30',
 '05:52:43',
 '00:40:24',
 '18:31:02',
 '17:00:17',
 '20:07:44',
 '03:54:22',
 '23:57:46',
 '18:44:32',
 '02:09:24',
 '01:19:47',
 '23:00:08',
 '03:07:12',
 '02:05:49',
 '19:30:01',
 '22:04:54',
 '17:17:44',
 '01:16:17',
 '00:22:39',
 '02:40:23',
 '04:21:26',
 '02:29:07',
 '23:35:28',
 '15:29:30',
 '21:39:24',
 '01:42:20',
 '01:52:18',
 '20:40:38',
 '16:52:08',
 '15:58:34',
 '16:28:21',
 '16:44:23',
 '00:15:59',
 '22:02:01',
 '15:44:53',
 '20:12:29',
 '22:16:42',
 '16:57:37',
 '16:12:33',
 '04:36:06',
 '01:00:13',
 '03:00:19',
 '04:00:04',
 '23:10:47',
 '01:41:06',
 '00:12:06',
 '01:42:22',
 '17:02:54',
 '02:21:30',
 '20:27:34',
 '00:03:26',
 '01:13:53',
 '02:39:42',
 '20:55:28',
 '17:16:20',
 '20:48:07',
 '23:55:18',
 '02:29:49',
 '18:19:37',
 '00:24:34',
 '01:26:42',
 '02:50:28',
 '17:26:08',
 '16:07:23',
 '22:04:39',
 '00:41:42',
 '16:52:05',
 '22:00:04',
 '01:36:14',
 '16:11:11',
 '00:37:48',
 '00:15:33',
 '17:00:08',
 '23:35:32',
 '04:17:01',
 '17:21:08',
 '00:28:40',
 '19:09:37',
 '04:37:05',
 '01:11:28',
 '22:02:38',
 '02:18:32',
 '18:51:56',
 '02:00:06',
 '16:06:11',
 '21:05:23',
 '02:53:12',
 '03:14:25',
 '19:16:47',
 '03:22:35',
 '17:51:13',
 '00:20:11',
 '02:43:18',
 '01:05:25',
 '21:54:41',
 '00:20:23',
 '03:33:17',
 '03:18:42',
 '19:10:13',
 '04:14:39',
 '18:38:36',
 '05:43:44',
 '22:54:44',
 '01:56:49',
 '16:03:00',
 '02:47:37',
 '01:24:14',
 '03:11:42',
 '02:09:34',
 '00:02:42',
 '03:32:10',
 '19:50:26',
 '00:00:35',
 '03:38:05',
 '01:33:55',
 '04:00:46',
 '18:10:30',
 '03:18:27',
 '17:37:36',
 '02:48:07',
 '17:42:57',
 '05:26:34',
 '22:14:07',
 '02:28:08',
 '17:58:09',
 '22:45:42',
 '19:22:38',
 '18:43:29',
 '00:00:02',
 '16:12:09',
 '02:41:38',
 '19:43:10',
 '21:34:09',
 '02:49:59',
 '00:01:00',
 '17:24:05',
 '03:20:20',
 '03:12:08',
 '01:25:33',
 '02:14:29',
 '02:03:45',
 '18:56:35',
 '01:50:18',
 '16:13:51',
 '16:09:13',
 '18:31:54',
 '03:08:26',
 '18:01:05',
 '03:58:55',
 '01:00:05',
 '03:00:47',
 '01:09:42',
 '01:26:04',
 '17:30:24',
 '17:46:12',
 '16:33:49',
 '17:34:13',
 '22:53:48',
 '18:33:48',
 '16:25:51',
 '16:30:13',
 '01:19:36',
 '03:40:16',
 '16:56:11',
 '03:47:25',
 '17:00:27',
 '19:35:46',
 '21:32:13',
 '17:04:07',
 '02:03:02',
 '18:26:02',
 '01:53:39',
 '17:29:20',
 '16:25:34',
 '02:36:23',
 '21:12:41',
 '23:52:16',
 '20:38:24',
 '22:45:43',
 '23:30:09',
 '01:19:32',
 '15:55:36',
 '04:59:42',
 '19:00:33',
 '21:18:40',
 '21:16:49',
 '13:04:55',
 '01:15:58',
 '00:32:10',
 '19:22:30',
 '16:18:34',
 '01:02:55',
 '23:30:47',
 '01:40:58',
 '01:29:35',
 '00:07:32',
 '02:49:55',
 '02:23:49',
 '02:06:06',
 '01:22:17',
 '23:47:07',
 '21:17:12',
 '02:24:13',
 '00:16:21',
 '01:02:50',
 '01:20:33',
 '02:00:27',
 '05:07:27',
 '00:45:35',
 '16:00:06',
 '02:48:49',
 '01:54:54',
 '01:42:26',
 '03:33:22',
 '20:47:36',
 '03:39:17',
 '15:14:57',
 '02:25:47',
 '01:59:37',
 '16:20:36',
 '00:44:30',
 '00:08:34',
 '17:12:53',
 '15:46:33',
 '04:18:42',
 '23:04:02',
 '18:36:06',
 '00:00:54',
 '03:45:22',
 '19:31:59',
 '03:19:24',
 '03:35:31',
 '01:41:23',
 '03:14:10',
 '03:14:30',
 '00:06:44',
 '15:39:48',
 '18:26:18',
 '04:38:35',
 '00:14:46',
 '01:03:12',
 '02:14:42',
 '19:39:43',
 '02:32:17',
 '23:10:52',
 '03:50:10',
 '23:02:22',
 '23:41:18',
 '23:53:52',
 '00:53:55',
 '17:12:16',
 '23:42:26',
 '04:44:10',
 '23:25:31',
 '17:04:16',
 '01:44:22',
 '00:38:54',
 '22:55:55',
 '04:45:50',
 '02:35:32',
 '16:58:45',
 '01:53:28',
 '17:16:37',
 '18:25:21',
 '19:13:05',
 '04:07:53',
 '04:03:51',
 '01:47:22',
 '01:38:00',
 '17:02:36',
 '17:01:29',
 '03:05:01',
 '21:00:18',
 '17:20:56',
 '03:28:27',
 '00:55:01',
 '16:33:36',
 '23:59:28',
 '23:43:25',
 '19:51:59',
 '03:54:25',
 '02:45:32',
 '01:12:59',
 '02:57:08',
 '18:25:07',
 '19:48:43',
 '18:03:45',
 '20:35:22',
 '17:19:36',
 '03:02:54',
 '03:57:12',
 '17:11:59',
 '03:24:33',
 '16:04:27',
 '15:13:52',
 '00:32:18',
 '00:54:18',
 '02:45:48',
 '15:51:22',
 '20:07:04',
 '01:25:37',
 '02:20:14',
 '03:52:26',
 '19:49:07',
 '15:51:39',
 '01:37:04',
 '15:57:26',
 '05:28:02',
 '17:30:22',
 '01:18:59',
 '16:59:01',
 '00:06:54',
 '02:43:09',
 '22:08:59',
 '01:05:59',
 '15:43:18',
 '23:04:13',
 '04:25:07',
 '18:10:33',
 '15:31:05',
 '02:54:12',
 '23:15:56',
 '04:05:59',
 '00:24:48',
 '23:20:02',
 '20:39:06',
 '04:46:13',
 '03:28:06',
 '12:14:36',
 '16:24:01',
 '23:13:58',
 '18:17:56',
 '18:18:36',
 '02:15:55',
 '19:38:25',
 '19:47:08',
 '23:42:19',
 '02:56:40',
 '04:14:49',
 '01:06:43',
 '00:02:45',
 '16:06:28',
 '00:57:05',
 '21:00:48',
 '16:06:54',
 '22:59:35',
 '01:42:25',
 '05:03:47',
 '17:00:24',
 '01:33:43',
 '01:18:12',
 '23:16:13',
 '00:06:08',
 '01:06:27',
 '17:18:34',
 '00:58:53',
 '04:35:11',
 '02:13:34',
 '01:18:00',
 '02:06:41',
 '01:22:14',
 '03:18:23',
 '00:05:17',
 '02:54:04',
 '15:59:50',
 '02:23:45',
 '00:05:25',
 '02:56:24',
 '20:30:39',
 '02:15:25',
 '16:20:15',
 '16:27:58',
 '16:28:54',
 '16:18:11',
 '18:09:16',
 '00:10:02',
 '16:50:42',
 '01:21:19',
 '21:52:49',
 '04:47:03',
 '19:42:02',
 '01:12:28',
 '02:31:39',
 '03:13:29',
 '16:54:26',
 '01:15:07',
 '01:56:44',
 '02:53:32',
 '01:02:48',
 '02:20:37',
 '04:22:44',
 '23:53:05',
 '16:10:12',
 '21:23:57',
 '16:27:23',
 '16:25:25',
 '21:31:28',
 '17:12:48',
 '01:07:18',
 '15:40:31',
 '00:04:36',
 '03:51:52',
 '03:27:11',
 '02:13:31',
 '03:55:41',
 '18:31:19',
 '23:59:09',
 '01:31:47',
 '20:41:33',
 '02:12:04',
 '15:49:57',
 '01:53:37',
 '04:06:20',
 '16:26:50',
 '15:58:40',
 '00:22:40',
 '17:38:19',
 '00:58:41',
 '01:46:03',
 '01:00:50',
 '23:42:03',
 '20:18:30',
 '02:22:04',
 '01:02:36',
 '00:43:57',
 '04:22:29',
 '01:14:35',
 '01:00:24',
 '01:58:22',
 '23:52:28',
 '19:45:39',
 '02:28:06',
 '23:54:11',
 '01:11:22',
 '00:47:34',
 '20:35:48',
 '18:48:16',
 '23:05:30',
 '01:02:40',
 '23:33:12',
 '02:52:48',
 '18:59:46',
 '16:13:10',
 '02:17:31',
 '22:38:43',
 '17:01:56',
 '23:18:48',
 '03:16:46',
 '23:13:03',
 '01:03:46',
 '23:23:41',
 '18:00:19',
 '19:35:59',
 '00:55:42',
 '01:12:38',
 '00:57:32',
 '23:13:01',
 '15:30:43',
 '03:48:51',
 '17:32:08',
 '00:31:29',
 '19:56:24',
 '02:07:05',
 '01:23:03',
 '15:59:51',
 '17:00:28',
 '17:36:16',
 '01:29:21',
 '17:02:56',
 '16:47:50',
 '16:30:07',
 '01:22:47',
 '17:23:53',
 '01:04:17',
 '23:52:22',
 '02:18:42',
 '02:20:27',
 '03:33:34',
 '04:34:45',
 '19:43:36',
 '23:44:41',
 '06:01:26',
 '16:08:49',
 '02:20:29',
 '03:58:25',
 '16:24:19',
 '02:17:13',
 '21:00:23',
 '15:07:30',
 '00:13:58',
 '04:48:02',
 '21:02:13',
 '16:11:18',
 '00:49:23',
 '23:50:52',
 '03:43:31',
 '16:14:39',
 '03:22:39',
 '01:27:03',
 '03:17:46',
 '20:40:47',
 '00:15:09',
 '02:32:25',
 '01:38:16',
 '15:25:23',
 '15:22:08',
 '15:26:28',
 '16:49:55',
 '21:01:17',
 '21:20:32',
 '18:24:26',
 '02:38:53',
 '22:58:05',
 '16:00:30',
 '18:49:36',
 '01:32:24',
 '22:51:24',
 '16:37:02',
 '21:06:41',
 '19:39:28',
 '03:55:45',
 '15:36:45',
 '02:33:29',
 '02:23:42',
 '00:17:27',
 ...}

2.2.2 Celaning Twitter archive file

  • Drop retweets:

Code:

In [23]:
df_ta_dropretweets = df_ta.copy()
In [24]:
df_ta_dropretweets = df_ta_dropretweets[df_ta_dropretweets.retweeted_status_id.isnull()]
df_ta_dropretweets = df_ta_dropretweets[df_ta_dropretweets.retweeted_status_user_id.isnull()]
df_ta_dropretweets = df_ta_dropretweets[df_ta_dropretweets.retweeted_status_timestamp.isnull()]

Test

In [25]:
len(df_ta_dropretweets["retweeted_status_id"].unique()) #get number of distinct values
Out[25]:
1
In [26]:
len(df_ta_dropretweets["retweeted_status_user_id"].unique()) #get number of distinct values
Out[26]:
1
In [27]:
len(df_ta_dropretweets["retweeted_status_timestamp"].unique()) #get number of distinct values
Out[27]:
1
  • Drop useless columns:

Code

In [28]:
df_ta_usefulcols = df_ta.copy()
In [29]:
drop_cols = ["in_reply_to_status_id","in_reply_to_user_id","retweeted_status_id","retweeted_status_user_id","retweeted_status_timestamp"]
df_ta_usefullcols = df_ta.drop(drop_cols, axis = 1)
df_ta_usefullcols.T
Out[29]:
0 1 2 3 4 5 6 7 8 9 ... 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355
tweet_id 892420643555336193 892177421306343426 891815181378084864 891689557279858688 891327558926688256 891087950875897856 890971913173991426 890729181411237888 890609185150312448 890240255349198849 ... 666058600524156928 666057090499244032 666055525042405380 666051853826850816 666050758794694657 666049248165822465 666044226329800704 666033412701032449 666029285002620928 666020888022790149
timestamp 2017-08-01 16:23:56 +0000 2017-08-01 00:17:27 +0000 2017-07-31 00:18:03 +0000 2017-07-30 15:58:51 +0000 2017-07-29 16:00:24 +0000 2017-07-29 00:08:17 +0000 2017-07-28 16:27:12 +0000 2017-07-28 00:22:40 +0000 2017-07-27 16:25:51 +0000 2017-07-26 15:59:51 +0000 ... 2015-11-16 01:01:59 +0000 2015-11-16 00:55:59 +0000 2015-11-16 00:49:46 +0000 2015-11-16 00:35:11 +0000 2015-11-16 00:30:50 +0000 2015-11-16 00:24:50 +0000 2015-11-16 00:04:52 +0000 2015-11-15 23:21:54 +0000 2015-11-15 23:05:30 +0000 2015-11-15 22:32:08 +0000
source <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> ... <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
text This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\r\n\r\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A ... Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj
expanded_urls https://twitter.com/dog_rates/status/892420643555336193/photo/1 https://twitter.com/dog_rates/status/892177421306343426/photo/1 https://twitter.com/dog_rates/status/891815181378084864/photo/1 https://twitter.com/dog_rates/status/891689557279858688/photo/1 https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 https://twitter.com/dog_rates/status/891087950875897856/photo/1 https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1 https://twitter.com/dog_rates/status/890729181411237888/photo/1,https://twitter.com/dog_rates/status/890729181411237888/photo/1 https://twitter.com/dog_rates/status/890609185150312448/photo/1 https://twitter.com/dog_rates/status/890240255349198849/photo/1 ... https://twitter.com/dog_rates/status/666058600524156928/photo/1 https://twitter.com/dog_rates/status/666057090499244032/photo/1 https://twitter.com/dog_rates/status/666055525042405380/photo/1 https://twitter.com/dog_rates/status/666051853826850816/photo/1 https://twitter.com/dog_rates/status/666050758794694657/photo/1 https://twitter.com/dog_rates/status/666049248165822465/photo/1 https://twitter.com/dog_rates/status/666044226329800704/photo/1 https://twitter.com/dog_rates/status/666033412701032449/photo/1 https://twitter.com/dog_rates/status/666029285002620928/photo/1 https://twitter.com/dog_rates/status/666020888022790149/photo/1
rating_numerator 13 13 12 13 12 13 13 13 13 14 ... 8 9 10 2 10 5 6 9 7 8
rating_denominator 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
name Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... the a a an a None a a a None
doggo None None None None None None None None None doggo ... None None None None None None None None None None
floofer None None None None None None None None None None ... None None None None None None None None None None
pupper None None None None None None None None None None ... None None None None None None None None None None
puppo None None None None None None None None None None ... None None None None None None None None None None

12 rows × 2356 columns

  • Get the dog_stage

    Recall that dog stage is stored across multiple columns (doggo, floofer, pupper and puppo), this information should be encoded in a single column.

    Those fields are one-hot encoded, we should reverse this encoding

Test

In [30]:
set(df_ta.doggo)
Out[30]:
{'None', 'doggo'}
In [31]:
set(df_ta.floofer)
Out[31]:
{'None', 'floofer'}
In [32]:
set(df_ta.pupper)
Out[32]:
{'None', 'pupper'}
In [33]:
set(df_ta.puppo)
Out[33]:
{'None', 'puppo'}

Code

In [34]:
dog_stage = df_ta_usefullcols[["doggo","floofer","pupper","puppo"]].replace("None","").apply(lambda x: ''.join(x.astype(str)),axis=1)

df_ta_dstage = df_ta_usefullcols.copy()
df_ta_dstage["dog_stage"] = dog_stage

Test

In [35]:
set(dog_stage)
Out[35]:
{'',
 'doggo',
 'doggofloofer',
 'doggopupper',
 'doggopuppo',
 'floofer',
 'pupper',
 'puppo'}

There is an unexpected issue, some entries are categorized with more than one dog_breed by the data source, this is due to the fact that more than one dog_stage appears on the text:

Test

In [36]:
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggofloofer"]
Out[36]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo dog_stage
200 854010172552949760 2017-04-17 16:34:26 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk https://twitter.com/dog_rates/status/854010172552949760/photo/1,https://twitter.com/dog_rates/status/854010172552949760/photo/1 11 10 None doggo floofer None None doggofloofer
In [37]:
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggopupper"]
Out[37]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo dog_stage
460 817777686764523521 2017-01-07 16:59:28 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Dido. She's playing the lead role in "Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple." 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7 https://twitter.com/dog_rates/status/817777686764523521/video/1 13 10 Dido doggo None pupper None doggopupper
531 808106460588765185 2016-12-12 00:29:28 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho https://twitter.com/dog_rates/status/808106460588765185/photo/1 12 10 None doggo None pupper None doggopupper
565 802265048156610565 2016-11-25 21:37:47 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Like doggo, like pupper version 2. Both 11/10 https://t.co/9IxWAXFqze https://twitter.com/dog_rates/status/802265048156610565/photo/1 11 10 None doggo None pupper None doggopupper
575 801115127852503040 2016-11-22 17:28:25 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj https://twitter.com/dog_rates/status/801115127852503040/photo/1,https://twitter.com/dog_rates/status/801115127852503040/photo/1 12 10 Bones doggo None pupper None doggopupper
705 785639753186217984 2016-10-11 00:34:48 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd https://twitter.com/dog_rates/status/785639753186217984/photo/1,https://twitter.com/dog_rates/status/785639753186217984/photo/1 10 10 Pinot doggo None pupper None doggopupper
733 781308096455073793 2016-09-29 01:42:20 +0000 <a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a> Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u https://vine.co/v/5rgu2Law2ut 12 10 None doggo None pupper None doggopupper
778 775898661951791106 2016-09-14 03:27:11 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> RT @dog_rates: Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda https://twitter.com/dog_rates/status/733109485275860992/photo/1,https://twitter.com/dog_rates/status/733109485275860992/photo/1 12 10 None doggo None pupper None doggopupper
822 770093767776997377 2016-08-29 03:00:36 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> RT @dog_rates: This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC https://twitter.com/dog_rates/status/741067306818797568/photo/1,https://twitter.com/dog_rates/status/741067306818797568/photo/1 12 10 just doggo None pupper None doggopupper
889 759793422261743616 2016-07-31 16:50:42 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Meet Maggie &amp; Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time https://t.co/MYwR4DQKll https://twitter.com/dog_rates/status/759793422261743616/photo/1,https://twitter.com/dog_rates/status/759793422261743616/photo/1 12 10 Maggie doggo None pupper None doggopupper
956 751583847268179968 2016-07-09 01:08:47 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8 https://twitter.com/dog_rates/status/751583847268179968/photo/1 5 10 None doggo None pupper None doggopupper
1063 741067306818797568 2016-06-10 00:39:48 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC https://twitter.com/dog_rates/status/741067306818797568/photo/1 12 10 just doggo None pupper None doggopupper
1113 733109485275860992 2016-05-19 01:38:16 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda https://twitter.com/dog_rates/status/733109485275860992/photo/1 12 10 None doggo None pupper None doggopupper
In [38]:
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggopuppo"]
Out[38]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo dog_stage
191 855851453814013952 2017-04-22 18:31:02 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel https://twitter.com/dog_rates/status/855851453814013952/photo/1 13 10 None doggo None None puppo doggopuppo

Since there are only 14 missclassifications (double classifications) in dog_stage we can manually fixing looking at the text

  • fix doggofloofer:

Test

In [39]:
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggofloofer"]
Out[39]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo dog_stage
200 854010172552949760 2017-04-17 16:34:26 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk https://twitter.com/dog_rates/status/854010172552949760/photo/1,https://twitter.com/dog_rates/status/854010172552949760/photo/1 11 10 None doggo floofer None None doggofloofer
In [40]:
df_ta_dstage.loc[df_ta_dstage.index == 200,'text'].iloc[0]
Out[40]:
"At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk"
In [41]:
df_ta_dstage.loc[df_ta_dstage.index == 200,'dog_stage'] = "floofer"
  • fix doggopuppo:

Code:

In [42]:
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggopuppo"]
Out[42]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo dog_stage
191 855851453814013952 2017-04-22 18:31:02 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel https://twitter.com/dog_rates/status/855851453814013952/photo/1 13 10 None doggo None None puppo doggopuppo
In [43]:
df_ta_dstage.loc[df_ta_dstage.index == 191,'text'].iloc[0]
Out[43]:
"Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel"
In [44]:
df_ta_dstage.loc[df_ta_dstage.index == 191,'dog_stage'] = "puppo"
  • fix doggopupper:

Code:

In [45]:
df_ta_dstage[df_ta_dstage["dog_stage"] =="doggopupper"]
Out[45]:
tweet_id timestamp source text expanded_urls rating_numerator rating_denominator name doggo floofer pupper puppo dog_stage
460 817777686764523521 2017-01-07 16:59:28 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Dido. She's playing the lead role in "Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple." 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7 https://twitter.com/dog_rates/status/817777686764523521/video/1 13 10 Dido doggo None pupper None doggopupper
531 808106460588765185 2016-12-12 00:29:28 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho https://twitter.com/dog_rates/status/808106460588765185/photo/1 12 10 None doggo None pupper None doggopupper
565 802265048156610565 2016-11-25 21:37:47 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Like doggo, like pupper version 2. Both 11/10 https://t.co/9IxWAXFqze https://twitter.com/dog_rates/status/802265048156610565/photo/1 11 10 None doggo None pupper None doggopupper
575 801115127852503040 2016-11-22 17:28:25 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj https://twitter.com/dog_rates/status/801115127852503040/photo/1,https://twitter.com/dog_rates/status/801115127852503040/photo/1 12 10 Bones doggo None pupper None doggopupper
705 785639753186217984 2016-10-11 00:34:48 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd https://twitter.com/dog_rates/status/785639753186217984/photo/1,https://twitter.com/dog_rates/status/785639753186217984/photo/1 10 10 Pinot doggo None pupper None doggopupper
733 781308096455073793 2016-09-29 01:42:20 +0000 <a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a> Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u https://vine.co/v/5rgu2Law2ut 12 10 None doggo None pupper None doggopupper
778 775898661951791106 2016-09-14 03:27:11 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> RT @dog_rates: Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda https://twitter.com/dog_rates/status/733109485275860992/photo/1,https://twitter.com/dog_rates/status/733109485275860992/photo/1 12 10 None doggo None pupper None doggopupper
822 770093767776997377 2016-08-29 03:00:36 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> RT @dog_rates: This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC https://twitter.com/dog_rates/status/741067306818797568/photo/1,https://twitter.com/dog_rates/status/741067306818797568/photo/1 12 10 just doggo None pupper None doggopupper
889 759793422261743616 2016-07-31 16:50:42 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Meet Maggie &amp; Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time https://t.co/MYwR4DQKll https://twitter.com/dog_rates/status/759793422261743616/photo/1,https://twitter.com/dog_rates/status/759793422261743616/photo/1 12 10 Maggie doggo None pupper None doggopupper
956 751583847268179968 2016-07-09 01:08:47 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8 https://twitter.com/dog_rates/status/751583847268179968/photo/1 5 10 None doggo None pupper None doggopupper
1063 741067306818797568 2016-06-10 00:39:48 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC https://twitter.com/dog_rates/status/741067306818797568/photo/1 12 10 just doggo None pupper None doggopupper
1113 733109485275860992 2016-05-19 01:38:16 +0000 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda https://twitter.com/dog_rates/status/733109485275860992/photo/1 12 10 None doggo None pupper None doggopupper

After reading the texts we realize that there are pictures that contain puppers&doggos at the same time:

Test:

In [46]:
df_ta_dstage[df_ta_dstage.index ==1113].text
Out[46]:
1113    Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda
Name: text, dtype: object

We were not expecting to have images with more than one dog breed, but we have them, for those images we will relabel them as pupper_doggo.

However there are other images that needed relabeling since they were referencing only one dog.

The changes will be:

  • [460, 575, 733] will be relabeled as pupper
  • [705,956] will be relabeled as doggo
  • [531,565, 778, 822, 889, 1063,1113] will be relabeled as pupper_doggo

where the list elements make reference to the index of the df_ta_dstage dataframe:

Code:

In [47]:
# relabel pupper
for index in [460, 575, 733]:
    df_ta_dstage.loc[df_ta_dstage.index == index,'dog_stage'] = "pupper"
    
for index in [705,956]:
    df_ta_dstage.loc[df_ta_dstage.index == index,'dog_stage'] = "doggo"

for index in [531,565, 778, 822, 889, 1063,1113]:
    df_ta_dstage.loc[df_ta_dstage.index == index,'dog_stage'] = "pupper_doggo"
In [48]:
df_ta_dstage.dog_stage = df_ta_dstage.dog_stage.replace("","Unknown")
In [49]:
df_ta_dstage = df_ta_dstage.drop(columns = ["doggo", "floofer", "pupper", "puppo"], axis = 1)
df_ta_dstage.T
Out[49]:
0 1 2 3 4 5 6 7 8 9 ... 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355
tweet_id 892420643555336193 892177421306343426 891815181378084864 891689557279858688 891327558926688256 891087950875897856 890971913173991426 890729181411237888 890609185150312448 890240255349198849 ... 666058600524156928 666057090499244032 666055525042405380 666051853826850816 666050758794694657 666049248165822465 666044226329800704 666033412701032449 666029285002620928 666020888022790149
timestamp 2017-08-01 16:23:56 +0000 2017-08-01 00:17:27 +0000 2017-07-31 00:18:03 +0000 2017-07-30 15:58:51 +0000 2017-07-29 16:00:24 +0000 2017-07-29 00:08:17 +0000 2017-07-28 16:27:12 +0000 2017-07-28 00:22:40 +0000 2017-07-27 16:25:51 +0000 2017-07-26 15:59:51 +0000 ... 2015-11-16 01:01:59 +0000 2015-11-16 00:55:59 +0000 2015-11-16 00:49:46 +0000 2015-11-16 00:35:11 +0000 2015-11-16 00:30:50 +0000 2015-11-16 00:24:50 +0000 2015-11-16 00:04:52 +0000 2015-11-15 23:21:54 +0000 2015-11-15 23:05:30 +0000 2015-11-15 22:32:08 +0000
source <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> ... <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
text This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\r\n\r\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A ... Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj
expanded_urls https://twitter.com/dog_rates/status/892420643555336193/photo/1 https://twitter.com/dog_rates/status/892177421306343426/photo/1 https://twitter.com/dog_rates/status/891815181378084864/photo/1 https://twitter.com/dog_rates/status/891689557279858688/photo/1 https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 https://twitter.com/dog_rates/status/891087950875897856/photo/1 https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1 https://twitter.com/dog_rates/status/890729181411237888/photo/1,https://twitter.com/dog_rates/status/890729181411237888/photo/1 https://twitter.com/dog_rates/status/890609185150312448/photo/1 https://twitter.com/dog_rates/status/890240255349198849/photo/1 ... https://twitter.com/dog_rates/status/666058600524156928/photo/1 https://twitter.com/dog_rates/status/666057090499244032/photo/1 https://twitter.com/dog_rates/status/666055525042405380/photo/1 https://twitter.com/dog_rates/status/666051853826850816/photo/1 https://twitter.com/dog_rates/status/666050758794694657/photo/1 https://twitter.com/dog_rates/status/666049248165822465/photo/1 https://twitter.com/dog_rates/status/666044226329800704/photo/1 https://twitter.com/dog_rates/status/666033412701032449/photo/1 https://twitter.com/dog_rates/status/666029285002620928/photo/1 https://twitter.com/dog_rates/status/666020888022790149/photo/1
rating_numerator 13 13 12 13 12 13 13 13 13 14 ... 8 9 10 2 10 5 6 9 7 8
rating_denominator 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
name Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... the a a an a None a a a None
dog_stage Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown doggo ... Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown

9 rows × 2356 columns

  • Clean rating (divide numerator and denominator and get a percentage):

Code:

In [50]:
df_ta_rating = df_ta_dstage.copy()
df_ta_rating["rating"] = 100*df_ta_rating.rating_numerator/df_ta_rating.rating_denominator
df_ta_rating = df_ta_rating.drop(["rating_numerator", "rating_denominator"], axis = 1)
df_ta_rating.T
Out[50]:
0 1 2 3 4 5 6 7 8 9 ... 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355
tweet_id 892420643555336193 892177421306343426 891815181378084864 891689557279858688 891327558926688256 891087950875897856 890971913173991426 890729181411237888 890609185150312448 890240255349198849 ... 666058600524156928 666057090499244032 666055525042405380 666051853826850816 666050758794694657 666049248165822465 666044226329800704 666033412701032449 666029285002620928 666020888022790149
timestamp 2017-08-01 16:23:56 +0000 2017-08-01 00:17:27 +0000 2017-07-31 00:18:03 +0000 2017-07-30 15:58:51 +0000 2017-07-29 16:00:24 +0000 2017-07-29 00:08:17 +0000 2017-07-28 16:27:12 +0000 2017-07-28 00:22:40 +0000 2017-07-27 16:25:51 +0000 2017-07-26 15:59:51 +0000 ... 2015-11-16 01:01:59 +0000 2015-11-16 00:55:59 +0000 2015-11-16 00:49:46 +0000 2015-11-16 00:35:11 +0000 2015-11-16 00:30:50 +0000 2015-11-16 00:24:50 +0000 2015-11-16 00:04:52 +0000 2015-11-15 23:21:54 +0000 2015-11-15 23:05:30 +0000 2015-11-15 22:32:08 +0000
source <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> ... <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
text This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\r\n\r\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A ... Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj
expanded_urls https://twitter.com/dog_rates/status/892420643555336193/photo/1 https://twitter.com/dog_rates/status/892177421306343426/photo/1 https://twitter.com/dog_rates/status/891815181378084864/photo/1 https://twitter.com/dog_rates/status/891689557279858688/photo/1 https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 https://twitter.com/dog_rates/status/891087950875897856/photo/1 https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1 https://twitter.com/dog_rates/status/890729181411237888/photo/1,https://twitter.com/dog_rates/status/890729181411237888/photo/1 https://twitter.com/dog_rates/status/890609185150312448/photo/1 https://twitter.com/dog_rates/status/890240255349198849/photo/1 ... https://twitter.com/dog_rates/status/666058600524156928/photo/1 https://twitter.com/dog_rates/status/666057090499244032/photo/1 https://twitter.com/dog_rates/status/666055525042405380/photo/1 https://twitter.com/dog_rates/status/666051853826850816/photo/1 https://twitter.com/dog_rates/status/666050758794694657/photo/1 https://twitter.com/dog_rates/status/666049248165822465/photo/1 https://twitter.com/dog_rates/status/666044226329800704/photo/1 https://twitter.com/dog_rates/status/666033412701032449/photo/1 https://twitter.com/dog_rates/status/666029285002620928/photo/1 https://twitter.com/dog_rates/status/666020888022790149/photo/1
name Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... the a a an a None a a a None
dog_stage Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown doggo ... Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
rating 130 130 120 130 120 130 130 130 130 140 ... 80 90 100 20 100 50 60 90 70 80

8 rows × 2356 columns

  • Simplifying column source

Code:

In [51]:
df_ta_source = df_ta_rating.copy()
In [52]:
set(df_ta_source.source)
Out[52]:
{'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
 '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
 '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'}
In [53]:
dict_source = dict()
dict_source['<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>'] = "web_client"
dict_source['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'] = "iphone"
dict_source['<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>'] = 'vine'
dict_source['<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'] = 'tweet deck'
In [54]:
df_ta_source.source = [dict_source[x] for x in df_ta_source.source]
In [55]:
df_ta_source.T
Out[55]:
0 1 2 3 4 5 6 7 8 9 ... 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355
tweet_id 892420643555336193 892177421306343426 891815181378084864 891689557279858688 891327558926688256 891087950875897856 890971913173991426 890729181411237888 890609185150312448 890240255349198849 ... 666058600524156928 666057090499244032 666055525042405380 666051853826850816 666050758794694657 666049248165822465 666044226329800704 666033412701032449 666029285002620928 666020888022790149
timestamp 2017-08-01 16:23:56 +0000 2017-08-01 00:17:27 +0000 2017-07-31 00:18:03 +0000 2017-07-30 15:58:51 +0000 2017-07-29 16:00:24 +0000 2017-07-29 00:08:17 +0000 2017-07-28 16:27:12 +0000 2017-07-28 00:22:40 +0000 2017-07-27 16:25:51 +0000 2017-07-26 15:59:51 +0000 ... 2015-11-16 01:01:59 +0000 2015-11-16 00:55:59 +0000 2015-11-16 00:49:46 +0000 2015-11-16 00:35:11 +0000 2015-11-16 00:30:50 +0000 2015-11-16 00:24:50 +0000 2015-11-16 00:04:52 +0000 2015-11-15 23:21:54 +0000 2015-11-15 23:05:30 +0000 2015-11-15 22:32:08 +0000
source iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone ... iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone
text This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\r\n\r\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A ... Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj
expanded_urls https://twitter.com/dog_rates/status/892420643555336193/photo/1 https://twitter.com/dog_rates/status/892177421306343426/photo/1 https://twitter.com/dog_rates/status/891815181378084864/photo/1 https://twitter.com/dog_rates/status/891689557279858688/photo/1 https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 https://twitter.com/dog_rates/status/891087950875897856/photo/1 https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1 https://twitter.com/dog_rates/status/890729181411237888/photo/1,https://twitter.com/dog_rates/status/890729181411237888/photo/1 https://twitter.com/dog_rates/status/890609185150312448/photo/1 https://twitter.com/dog_rates/status/890240255349198849/photo/1 ... https://twitter.com/dog_rates/status/666058600524156928/photo/1 https://twitter.com/dog_rates/status/666057090499244032/photo/1 https://twitter.com/dog_rates/status/666055525042405380/photo/1 https://twitter.com/dog_rates/status/666051853826850816/photo/1 https://twitter.com/dog_rates/status/666050758794694657/photo/1 https://twitter.com/dog_rates/status/666049248165822465/photo/1 https://twitter.com/dog_rates/status/666044226329800704/photo/1 https://twitter.com/dog_rates/status/666033412701032449/photo/1 https://twitter.com/dog_rates/status/666029285002620928/photo/1 https://twitter.com/dog_rates/status/666020888022790149/photo/1
name Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... the a a an a None a a a None
dog_stage Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown doggo ... Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
rating 130 130 120 130 120 130 130 130 130 140 ... 80 90 100 20 100 50 60 90 70 80

8 rows × 2356 columns

  • From column name we have realized that names that do not have the first letter as capital are misslabeled names, we will replace those values for a Unknown.

Code:

In [56]:
def map_lowercase_unknown(word):
    if (word[0].isupper()):
        return(word)
    return("Unknown")

df_ta_names = df_ta_source.copy()
df_ta_names["names"] = [map_lowercase_unknown(x) for x in df_ta_names.name]
set(df_ta_names.names)
Out[56]:
{'Abby',
 'Ace',
 'Acro',
 'Adele',
 'Aiden',
 'Aja',
 'Akumi',
 'Al',
 'Albert',
 'Albus',
 'Aldrick',
 'Alejandro',
 'Alexander',
 'Alexanderson',
 'Alf',
 'Alfie',
 'Alfy',
 'Alice',
 'Amber',
 'Ambrose',
 'Amy',
 'Amélie',
 'Anakin',
 'Andru',
 'Andy',
 'Angel',
 'Anna',
 'Anthony',
 'Antony',
 'Apollo',
 'Aqua',
 'Archie',
 'Arlen',
 'Arlo',
 'Arnie',
 'Arnold',
 'Arya',
 'Ash',
 'Asher',
 'Ashleigh',
 'Aspen',
 'Astrid',
 'Atlas',
 'Atticus',
 'Aubie',
 'Augie',
 'Autumn',
 'Ava',
 'Axel',
 'Bailey',
 'Baloo',
 'Balto',
 'Banditt',
 'Banjo',
 'Barclay',
 'Barney',
 'Baron',
 'Barry',
 'Batdog',
 'Bauer',
 'Baxter',
 'Bayley',
 'BeBe',
 'Bear',
 'Beau',
 'Beckham',
 'Beebop',
 'Beemo',
 'Bell',
 'Bella',
 'Belle',
 'Ben',
 'Benedict',
 'Benji',
 'Benny',
 'Bentley',
 'Berb',
 'Berkeley',
 'Bernie',
 'Bert',
 'Bertson',
 'Betty',
 'Beya',
 'Biden',
 'Bilbo',
 'Billl',
 'Billy',
 'Binky',
 'Birf',
 'Bisquick',
 'Blakely',
 'Blanket',
 'Blipson',
 'Blitz',
 'Bloo',
 'Bloop',
 'Blu',
 'Blue',
 'Bluebert',
 'Bo',
 'Bob',
 'Bobb',
 'Bobbay',
 'Bobble',
 'Bobby',
 'Bode',
 'Bodie',
 'Bonaparte',
 'Bones',
 'Bookstore',
 'Boomer',
 'Boots',
 'Boston',
 'Bowie',
 'Brad',
 'Bradlay',
 'Bradley',
 'Brady',
 'Brandi',
 'Brandonald',
 'Brandy',
 'Brat',
 'Brian',
 'Brockly',
 'Brody',
 'Bronte',
 'Brooks',
 'Brownie',
 'Bruce',
 'Brudge',
 'Bruiser',
 'Bruno',
 'Brutus',
 'Bubba',
 'Bubbles',
 'Buckley',
 'Buddah',
 'Buddy',
 'Bungalo',
 'Burt',
 'Butter',
 'Butters',
 'Cal',
 'Calbert',
 'Cali',
 'Callie',
 'Calvin',
 'Canela',
 'Cannon',
 'Carbon',
 'Carl',
 'Carll',
 'Carly',
 'Carper',
 'Carter',
 'Caryl',
 'Cash',
 'Cassie',
 'CeCe',
 'Cecil',
 'Cedrick',
 'Cermet',
 'Chadrick',
 'Champ',
 'Charl',
 'Charles',
 'Charleson',
 'Charlie',
 'Chase',
 'Chaz',
 'Cheesy',
 'Chef',
 'Chelsea',
 'Cheryl',
 'Chesney',
 'Chester',
 'Chesterson',
 'Chet',
 'Chevy',
 'Chip',
 'Chipson',
 'Chloe',
 'Chompsky',
 'Christoper',
 'Chubbs',
 'Chuck',
 'Chuckles',
 'Chuq',
 'Churlie',
 'Cilantro',
 'Clarence',
 'Clark',
 'Clarkus',
 'Clarq',
 'Claude',
 'Cleopatricia',
 'Clifford',
 'Clybe',
 'Clyde',
 'Coco',
 'Cody',
 'Colby',
 'Coleman',
 'Colin',
 'Combo',
 'Comet',
 'Cooper',
 'Coops',
 'Coopson',
 'Cora',
 'Corey',
 'Covach',
 'Craig',
 'Crawford',
 'Creg',
 'Crimson',
 'Crouton',
 'Crumpet',
 'Crystal',
 'Cuddles',
 'Cupcake',
 'Cupid',
 'Curtis',
 'Daisy',
 'Dakota',
 'Dale',
 'Dallas',
 'Damon',
 'Daniel',
 'Danny',
 'Dante',
 'Darby',
 'Darla',
 'Darrel',
 'Dash',
 'Dave',
 'Davey',
 'Dawn',
 'DayZ',
 'Deacon',
 'Derby',
 'Derek',
 'Devón',
 'Dewey',
 'Dex',
 'Dexter',
 'Dido',
 'Dietrich',
 'Diogi',
 'Divine',
 'Dixie',
 'Django',
 'Dobby',
 'Doc',
 'DonDon',
 'Donny',
 'Doobert',
 'Dook',
 'Dot',
 'Dotsy',
 'Doug',
 'Duchess',
 'Duddles',
 'Dudley',
 'Dug',
 'Duke',
 'Dunkin',
 'Durg',
 'Dutch',
 'Dwight',
 'Dylan',
 'Earl',
 'Eazy',
 'Ebby',
 'Ed',
 'Edd',
 'Edgar',
 'Edmund',
 'Eevee',
 'Einstein',
 'Eleanor',
 'Eli',
 'Ellie',
 'Elliot',
 'Emanuel',
 'Ember',
 'Emma',
 'Emmie',
 'Emmy',
 'Enchilada',
 'Erik',
 'Eriq',
 'Ester',
 'Eugene',
 'Eve',
 'Evy',
 'Fabio',
 'Farfle',
 'Ferg',
 'Fido',
 'Fiji',
 'Fillup',
 'Filup',
 'Finley',
 'Finn',
 'Finnegus',
 'Fiona',
 'Fizz',
 'Flash',
 'Fletcher',
 'Florence',
 'Flurpson',
 'Flávio',
 'Frank',
 'Frankie',
 'Franklin',
 'Franq',
 'Fred',
 'Freddery',
 'Frönq',
 'Furzey',
 'Fwed',
 'Fynn',
 'Gabby',
 'Gabe',
 'Gary',
 'General',
 'Genevieve',
 'Geno',
 'Geoff',
 'George',
 'Georgie',
 'Gerald',
 'Gerbald',
 'Gert',
 'Gidget',
 'Gilbert',
 'Gin',
 'Ginger',
 'Gizmo',
 'Glacier',
 'Glenn',
 'Godi',
 'Godzilla',
 'Goliath',
 'Goose',
 'Gordon',
 'Grady',
 'Grey',
 'Griffin',
 'Griswold',
 'Grizz',
 'Grizzie',
 'Grizzwald',
 'Gromit',
 'Gunner',
 'Gus',
 'Gustaf',
 'Gustav',
 'Gòrdón',
 'Hall',
 'Halo',
 'Hammond',
 'Hamrick',
 'Hank',
 'Hanz',
 'Happy',
 'Harlso',
 'Harnold',
 'Harold',
 'Harper',
 'Harrison',
 'Harry',
 'Harvey',
 'Hazel',
 'Hector',
 'Heinrich',
 'Henry',
 'Herald',
 'Herb',
 'Hercules',
 'Herm',
 'Hermione',
 'Hero',
 'Herschel',
 'Hobbes',
 'Holly',
 'Horace',
 'Howie',
 'Hubertson',
 'Huck',
 'Humphrey',
 'Hunter',
 'Hurley',
 'Huxley',
 'Iggy',
 'Ike',
 'Indie',
 'Iroh',
 'Ito',
 'Ivar',
 'Izzy',
 'JD',
 'Jack',
 'Jackie',
 'Jackson',
 'Jameson',
 'Jamesy',
 'Jangle',
 'Jareld',
 'Jarod',
 'Jarvis',
 'Jaspers',
 'Jax',
 'Jay',
 'Jaycob',
 'Jazz',
 'Jazzy',
 'Jeb',
 'Jebberson',
 'Jed',
 'Jeffrey',
 'Jeffri',
 'Jeffrie',
 'Jennifur',
 'Jeph',
 'Jeremy',
 'Jerome',
 'Jerry',
 'Jersey',
 'Jesse',
 'Jessifer',
 'Jessiga',
 'Jett',
 'Jim',
 'Jimbo',
 'Jiminus',
 'Jiminy',
 'Jimison',
 'Jimothy',
 'Jo',
 'Jockson',
 'Joey',
 'Jomathan',
 'Jonah',
 'Jordy',
 'Josep',
 'Joshwa',
 'Juckson',
 'Julio',
 'Julius',
 'Juno',
 'Kaia',
 'Kaiya',
 'Kallie',
 'Kane',
 'Kanu',
 'Kara',
 'Karl',
 'Karll',
 'Karma',
 'Kathmandu',
 'Katie',
 'Kawhi',
 'Kayla',
 'Keet',
 'Keith',
 'Kellogg',
 'Ken',
 'Kendall',
 'Kenneth',
 'Kenny',
 'Kenzie',
 'Keurig',
 'Kevin',
 'Kevon',
 'Kial',
 'Kilo',
 'Kingsley',
 'Kirby',
 'Kirk',
 'Klein',
 'Klevin',
 'Kloey',
 'Kobe',
 'Koda',
 'Kody',
 'Koko',
 'Kollin',
 'Kona',
 'Kota',
 'Kramer',
 'Kreg',
 'Kreggory',
 'Kulet',
 'Kuyu',
 'Kyle',
 'Kyro',
 'Lacy',
 'Laela',
 'Laika',
 'Lambeau',
 'Lance',
 'Larry',
 'Lassie',
 'Layla',
 'Leela',
 'Lennon',
 'Lenny',
 'Lenox',
 'Leo',
 'Leonard',
 'Leonidas',
 'Levi',
 'Liam',
 'Lilah',
 'Lili',
 'Lilli',
 'Lillie',
 'Lilly',
 'Lily',
 'Lincoln',
 'Linda',
 'Link',
 'Linus',
 'Lipton',
 'Livvie',
 'Lizzie',
 'Logan',
 'Loki',
 'Lola',
 'Lolo',
 'Longfellow',
 'Loomis',
 'Lorelei',
 'Lorenzo',
 'Lou',
 'Louie',
 'Louis',
 'Luca',
 'Lucia',
 'Lucky',
 'Lucy',
 'Lugan',
 'Lulu',
 'Luna',
 'Lupe',
 'Luther',
 'Mabel',
 'Mac',
 'Mack',
 'Maddie',
 'Maggie',
 'Mairi',
 'Maisey',
 'Major',
 'Maks',
 'Malcolm',
 'Malikai',
 'Margo',
 'Mark',
 'Marlee',
 'Marley',
 'Marq',
 'Marty',
 'Marvin',
 'Mary',
 'Mason',
 'Mattie',
 'Maude',
 'Mauve',
 'Max',
 'Maxaroni',
 'Maximus',
 'Maxwell',
 'Maya',
 'Meatball',
 'Meera',
 'Meyer',
 'Mia',
 'Michelangelope',
 'Miguel',
 'Mike',
 'Miley',
 'Milky',
 'Millie',
 'Milo',
 'Mimosa',
 'Mingus',
 'Mister',
 'Misty',
 'Mitch',
 'Mo',
 'Moe',
 'Mojo',
 'Mollie',
 'Molly',
 'Mona',
 'Monkey',
 'Monster',
 'Monty',
 'Moofasa',
 'Mookie',
 'Moose',
 'Moreton',
 'Mosby',
 'Murphy',
 'Mutt',
 'Mya',
 'Nala',
 'Naphaniel',
 'Napolean',
 'Nelly',
 'Neptune',
 'Newt',
 'Nico',
 'Nida',
 'Nigel',
 'Nimbus',
 'Noah',
 'Nollie',
 'None',
 'Noosh',
 'Norman',
 'Nugget',
 'O',
 'Oakley',
 'Obi',
 'Obie',
 'Oddie',
 'Odie',
 'Odin',
 'Olaf',
 'Ole',
 'Olive',
 'Oliver',
 'Olivia',
 'Oliviér',
 'Ollie',
 'Opal',
 'Opie',
 'Oreo',
 'Orion',
 'Oscar',
 'Oshie',
 'Otis',
 'Ozzie',
 'Ozzy',
 'Pablo',
 'Paisley',
 'Pancake',
 'Panda',
 'Patch',
 'Patrick',
 'Paull',
 'Pavlov',
 'Pawnd',
 'Peaches',
 'Peanut',
 'Penelope',
 'Penny',
 'Pepper',
 'Percy',
 'Perry',
 'Pete',
 'Petrick',
 'Pherb',
 'Phil',
 'Philbert',
 'Philippe',
 'Phineas',
 'Phred',
 'Pickles',
 'Pilot',
 'Pinot',
 'Pip',
 'Piper',
 'Pippa',
 'Pippin',
 'Pipsy',
 'Pluto',
 'Poppy',
 'Pubert',
 'Puff',
 'Pumpkin',
 'Pupcasso',
 'Quinn',
 'Ralf',
 'Ralph',
 'Ralpher',
 'Ralphie',
 'Ralphson',
 'Ralphus',
 'Ralphy',
 'Ralphé',
 'Rambo',
 'Randall',
 'Raphael',
 'Rascal',
 'Raymond',
 'Reagan',
 'Reese',
 'Reggie',
 'Reginald',
 'Remington',
 'Remus',
 'Remy',
 'Reptar',
 'Rey',
 'Rhino',
 'Richie',
 'Ricky',
 'Ridley',
 'Riley',
 'Rilo',
 'Rinna',
 'River',
 'Rizzo',
 'Rizzy',
 'Robin',
 'Rocco',
 'Rocky',
 'Rodman',
 'Rodney',
 'Rolf',
 'Romeo',
 'Ron',
 'Ronduh',
 'Ronnie',
 'Rontu',
 'Rooney',
 'Roosevelt',
 'Rorie',
 'Rory',
 'Roscoe',
 'Rose',
 'Rosie',
 'Rover',
 'Rubio',
 'Ruby',
 'Rudy',
 'Rueben',
 'Ruffles',
 'Rufio',
 'Rufus',
 'Rumble',
 'Rumpole',
 'Rupert',
 'Rusty',
 'Sadie',
 'Sage',
 'Sailer',
 'Sailor',
 'Sam',
 'Sammy',
 'Sampson',
 'Samsom',
 'Samson',
 'Sandra',
 'Sandy',
 'Sansa',
 'Sarge',
 'Saydee',
 'Schnitzel',
 'Schnozz',
 'Scooter',
 'Scott',
 'Scout',
 'Scruffers',
 'Seamus',
 'Sebastian',
 'Sephie',
 'Severus',
 'Shadoe',
 'Shadow',
 'Shaggy',
 'Shakespeare',
 'Shawwn',
 'Shelby',
 'Shikha',
 'Shiloh',
 'Shnuggles',
 'Shooter',
 'Siba',
 'Sid',
 'Sierra',
 'Simba',
 'Skittle',
 'Skittles',
 'Sky',
 'Skye',
 'Smiley',
 'Smokey',
 'Snickers',
 'Snicku',
 'Snoop',
 'Snoopy',
 'Sobe',
 'Socks',
 'Sojourner',
 'Solomon',
 'Sonny',
 'Sophie',
 'Sora',
 'Spanky',
 'Spark',
 'Sparky',
 'Spencer',
 'Sprinkles',
 'Sprout',
 'Staniel',
 'Stanley',
 'Stark',
 'Stefan',
 'Stella',
 'Stephan',
 'Stephanus',
 'Steve',
 'Steven',
 'Stewie',
 'Storkson',
 'Stormy',
 'Strider',
 'Striker',
 'Strudel',
 'Stu',
 'Stuart',
 'Stubert',
 'Sugar',
 'Suki',
 'Sully',
 'Sundance',
 'Sunny',
 'Sunshine',
 'Superpup',
 'Swagger',
 'Sweet',
 'Sweets',
 'Taco',
 'Tango',
 'Tanner',
 'Tassy',
 'Tater',
 'Tayzie',
 'Taz',
 'Tebow',
 'Ted',
 'Tedders',
 'Teddy',
 'Tedrick',
 'Terrance',
 'Terrenth',
 'Terry',
 'Tess',
 'Tessa',
 'Theo',
 'Theodore',
 'Thor',
 'Thumas',
 'Tiger',
 'Tilly',
 'Timber',
 'Timison',
 'Timmy',
 'Timofy',
 'Tino',
 'Titan',
 'Tito',
 'Tobi',
 'Toby',
 'Todo',
 'Toffee',
 'Tom',
 'Tommy',
 'Tonks',
 'Torque',
 'Tove',
 'Travis',
 'Traviss',
 'Trevith',
 'Trigger',
 'Trip',
 'Tripp',
 'Trooper',
 'Tuck',
 'Tucker',
 'Tuco',
 'Tug',
 'Tupawc',
 'Tycho',
 'Tyr',
 'Tyrone',
 'Tyrus',
 'Ulysses',
 'Unknown',
 'Venti',
 'Vince',
 'Vincent',
 'Vinnie',
 'Vinscent',
 'Vixen',
 'Wafer',
 'Waffles',
 'Walker',
 'Wallace',
 'Wally',
 'Walter',
 'Watson',
 'Wesley',
 'Wiggles',
 'Willem',
 'William',
 'Willie',
 'Willow',
 'Willy',
 'Wilson',
 'Winifred',
 'Winnie',
 'Winston',
 'Wishes',
 'Wyatt',
 'Yoda',
 'Yogi',
 'Yukon',
 'Zara',
 'Zeek',
 'Zeke',
 'Zeus',
 'Ziva',
 'Zoe',
 'Zoey',
 'Zooey',
 'Zuzu'}
In [57]:
df_ta_names.T
Out[57]:
0 1 2 3 4 5 6 7 8 9 ... 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355
tweet_id 892420643555336193 892177421306343426 891815181378084864 891689557279858688 891327558926688256 891087950875897856 890971913173991426 890729181411237888 890609185150312448 890240255349198849 ... 666058600524156928 666057090499244032 666055525042405380 666051853826850816 666050758794694657 666049248165822465 666044226329800704 666033412701032449 666029285002620928 666020888022790149
timestamp 2017-08-01 16:23:56 +0000 2017-08-01 00:17:27 +0000 2017-07-31 00:18:03 +0000 2017-07-30 15:58:51 +0000 2017-07-29 16:00:24 +0000 2017-07-29 00:08:17 +0000 2017-07-28 16:27:12 +0000 2017-07-28 00:22:40 +0000 2017-07-27 16:25:51 +0000 2017-07-26 15:59:51 +0000 ... 2015-11-16 01:01:59 +0000 2015-11-16 00:55:59 +0000 2015-11-16 00:49:46 +0000 2015-11-16 00:35:11 +0000 2015-11-16 00:30:50 +0000 2015-11-16 00:24:50 +0000 2015-11-16 00:04:52 +0000 2015-11-15 23:21:54 +0000 2015-11-15 23:05:30 +0000 2015-11-15 22:32:08 +0000
source iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone ... iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone
text This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\r\n\r\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A ... Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj
expanded_urls https://twitter.com/dog_rates/status/892420643555336193/photo/1 https://twitter.com/dog_rates/status/892177421306343426/photo/1 https://twitter.com/dog_rates/status/891815181378084864/photo/1 https://twitter.com/dog_rates/status/891689557279858688/photo/1 https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1 https://twitter.com/dog_rates/status/891087950875897856/photo/1 https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1 https://twitter.com/dog_rates/status/890729181411237888/photo/1,https://twitter.com/dog_rates/status/890729181411237888/photo/1 https://twitter.com/dog_rates/status/890609185150312448/photo/1 https://twitter.com/dog_rates/status/890240255349198849/photo/1 ... https://twitter.com/dog_rates/status/666058600524156928/photo/1 https://twitter.com/dog_rates/status/666057090499244032/photo/1 https://twitter.com/dog_rates/status/666055525042405380/photo/1 https://twitter.com/dog_rates/status/666051853826850816/photo/1 https://twitter.com/dog_rates/status/666050758794694657/photo/1 https://twitter.com/dog_rates/status/666049248165822465/photo/1 https://twitter.com/dog_rates/status/666044226329800704/photo/1 https://twitter.com/dog_rates/status/666033412701032449/photo/1 https://twitter.com/dog_rates/status/666029285002620928/photo/1 https://twitter.com/dog_rates/status/666020888022790149/photo/1
name Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... the a a an a None a a a None
dog_stage Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown doggo ... Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
rating 130 130 120 130 120 130 130 130 130 140 ... 80 90 100 20 100 50 60 90 70 80
names Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... Unknown Unknown Unknown Unknown Unknown None Unknown Unknown Unknown None

9 rows × 2356 columns

  • The expanded_urls column is giving more than one vlaue for tweets with more than one image, and those values are repeated separated by columns.

    I will take a simplified approach where we will get a single url for each tweet (the first element of the list or first photo).

Code:

In [58]:
df_ta_urls = df_ta_names.copy()
df_ta_urls.expanded_urls = [str(x).split(",")[0] for x in df_ta_names.expanded_urls]
In [59]:
df_ta_urls.T
Out[59]:
0 1 2 3 4 5 6 7 8 9 ... 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355
tweet_id 892420643555336193 892177421306343426 891815181378084864 891689557279858688 891327558926688256 891087950875897856 890971913173991426 890729181411237888 890609185150312448 890240255349198849 ... 666058600524156928 666057090499244032 666055525042405380 666051853826850816 666050758794694657 666049248165822465 666044226329800704 666033412701032449 666029285002620928 666020888022790149
timestamp 2017-08-01 16:23:56 +0000 2017-08-01 00:17:27 +0000 2017-07-31 00:18:03 +0000 2017-07-30 15:58:51 +0000 2017-07-29 16:00:24 +0000 2017-07-29 00:08:17 +0000 2017-07-28 16:27:12 +0000 2017-07-28 00:22:40 +0000 2017-07-27 16:25:51 +0000 2017-07-26 15:59:51 +0000 ... 2015-11-16 01:01:59 +0000 2015-11-16 00:55:59 +0000 2015-11-16 00:49:46 +0000 2015-11-16 00:35:11 +0000 2015-11-16 00:30:50 +0000 2015-11-16 00:24:50 +0000 2015-11-16 00:04:52 +0000 2015-11-15 23:21:54 +0000 2015-11-15 23:05:30 +0000 2015-11-15 22:32:08 +0000
source iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone ... iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone
text This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\r\n\r\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A ... Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj
expanded_urls https://twitter.com/dog_rates/status/892420643555336193/photo/1 https://twitter.com/dog_rates/status/892177421306343426/photo/1 https://twitter.com/dog_rates/status/891815181378084864/photo/1 https://twitter.com/dog_rates/status/891689557279858688/photo/1 https://twitter.com/dog_rates/status/891327558926688256/photo/1 https://twitter.com/dog_rates/status/891087950875897856/photo/1 https://gofundme.com/ydvmve-surgery-for-jax https://twitter.com/dog_rates/status/890729181411237888/photo/1 https://twitter.com/dog_rates/status/890609185150312448/photo/1 https://twitter.com/dog_rates/status/890240255349198849/photo/1 ... https://twitter.com/dog_rates/status/666058600524156928/photo/1 https://twitter.com/dog_rates/status/666057090499244032/photo/1 https://twitter.com/dog_rates/status/666055525042405380/photo/1 https://twitter.com/dog_rates/status/666051853826850816/photo/1 https://twitter.com/dog_rates/status/666050758794694657/photo/1 https://twitter.com/dog_rates/status/666049248165822465/photo/1 https://twitter.com/dog_rates/status/666044226329800704/photo/1 https://twitter.com/dog_rates/status/666033412701032449/photo/1 https://twitter.com/dog_rates/status/666029285002620928/photo/1 https://twitter.com/dog_rates/status/666020888022790149/photo/1
name Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... the a a an a None a a a None
dog_stage Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown doggo ... Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
rating 130 130 120 130 120 130 130 130 130 140 ... 80 90 100 20 100 50 60 90 70 80
names Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... Unknown Unknown Unknown Unknown Unknown None Unknown Unknown Unknown None

9 rows × 2356 columns

  • The timestamp column contains very valuable information, however it can be hard to consume. We wold like to get:
    • The hour (only the hour, not min, not second, in a single column)
    • The year
    • The month/year (calmonth)
    • The day/month/year (calday)
    • The day of the week

Code:

In [60]:
df_ta_time = df_ta_urls.copy()
In [61]:
df_ta_time.timestamp
Out[61]:
0       2017-08-01 16:23:56 +0000
1       2017-08-01 00:17:27 +0000
2       2017-07-31 00:18:03 +0000
3       2017-07-30 15:58:51 +0000
4       2017-07-29 16:00:24 +0000
                  ...            
2351    2015-11-16 00:24:50 +0000
2352    2015-11-16 00:04:52 +0000
2353    2015-11-15 23:21:54 +0000
2354    2015-11-15 23:05:30 +0000
2355    2015-11-15 22:32:08 +0000
Name: timestamp, Length: 2356, dtype: object
In [62]:
df_ta_time["date"] = [x.split(" ")[0] for x in df_ta_time.timestamp]
df_ta_time["time"] = [x.split(" ")[1] for x in df_ta_time.timestamp]
df_ta_time["hour"] = [int(x[:2]) for x in df_ta_time.time]

df_ta_time["day"] = [int(x[8:10]) for x in df_ta_time.date]
df_ta_time["month"] = [int(x[5:7]) for x in df_ta_time.date]
df_ta_time["year"] = [int(x[:4]) for x in df_ta_time.date]

df_ta_time["calmonth"] = df_ta_time['month'].map(str) + '-' +  df_ta_time['year'].map(str)


def get_day_name(row):
    return(calendar.day_name[date(row["year"],row["month"],row["day"]).weekday()])

df_ta_time["day_of_week"] = df_ta_time.apply(get_day_name, axis = 1)



df_ta_time.T
Out[62]:
0 1 2 3 4 5 6 7 8 9 ... 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355
tweet_id 892420643555336193 892177421306343426 891815181378084864 891689557279858688 891327558926688256 891087950875897856 890971913173991426 890729181411237888 890609185150312448 890240255349198849 ... 666058600524156928 666057090499244032 666055525042405380 666051853826850816 666050758794694657 666049248165822465 666044226329800704 666033412701032449 666029285002620928 666020888022790149
timestamp 2017-08-01 16:23:56 +0000 2017-08-01 00:17:27 +0000 2017-07-31 00:18:03 +0000 2017-07-30 15:58:51 +0000 2017-07-29 16:00:24 +0000 2017-07-29 00:08:17 +0000 2017-07-28 16:27:12 +0000 2017-07-28 00:22:40 +0000 2017-07-27 16:25:51 +0000 2017-07-26 15:59:51 +0000 ... 2015-11-16 01:01:59 +0000 2015-11-16 00:55:59 +0000 2015-11-16 00:49:46 +0000 2015-11-16 00:35:11 +0000 2015-11-16 00:30:50 +0000 2015-11-16 00:24:50 +0000 2015-11-16 00:04:52 +0000 2015-11-15 23:21:54 +0000 2015-11-15 23:05:30 +0000 2015-11-15 22:32:08 +0000
source iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone ... iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone
text This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\r\n\r\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A ... Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj
expanded_urls https://twitter.com/dog_rates/status/892420643555336193/photo/1 https://twitter.com/dog_rates/status/892177421306343426/photo/1 https://twitter.com/dog_rates/status/891815181378084864/photo/1 https://twitter.com/dog_rates/status/891689557279858688/photo/1 https://twitter.com/dog_rates/status/891327558926688256/photo/1 https://twitter.com/dog_rates/status/891087950875897856/photo/1 https://gofundme.com/ydvmve-surgery-for-jax https://twitter.com/dog_rates/status/890729181411237888/photo/1 https://twitter.com/dog_rates/status/890609185150312448/photo/1 https://twitter.com/dog_rates/status/890240255349198849/photo/1 ... https://twitter.com/dog_rates/status/666058600524156928/photo/1 https://twitter.com/dog_rates/status/666057090499244032/photo/1 https://twitter.com/dog_rates/status/666055525042405380/photo/1 https://twitter.com/dog_rates/status/666051853826850816/photo/1 https://twitter.com/dog_rates/status/666050758794694657/photo/1 https://twitter.com/dog_rates/status/666049248165822465/photo/1 https://twitter.com/dog_rates/status/666044226329800704/photo/1 https://twitter.com/dog_rates/status/666033412701032449/photo/1 https://twitter.com/dog_rates/status/666029285002620928/photo/1 https://twitter.com/dog_rates/status/666020888022790149/photo/1
name Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... the a a an a None a a a None
dog_stage Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown doggo ... Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
rating 130 130 120 130 120 130 130 130 130 140 ... 80 90 100 20 100 50 60 90 70 80
names Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... Unknown Unknown Unknown Unknown Unknown None Unknown Unknown Unknown None
date 2017-08-01 2017-08-01 2017-07-31 2017-07-30 2017-07-29 2017-07-29 2017-07-28 2017-07-28 2017-07-27 2017-07-26 ... 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-15 2015-11-15 2015-11-15
time 16:23:56 00:17:27 00:18:03 15:58:51 16:00:24 00:08:17 16:27:12 00:22:40 16:25:51 15:59:51 ... 01:01:59 00:55:59 00:49:46 00:35:11 00:30:50 00:24:50 00:04:52 23:21:54 23:05:30 22:32:08
hour 16 0 0 15 16 0 16 0 16 15 ... 1 0 0 0 0 0 0 23 23 22
day 1 1 31 30 29 29 28 28 27 26 ... 16 16 16 16 16 16 16 15 15 15
month 8 8 7 7 7 7 7 7 7 7 ... 11 11 11 11 11 11 11 11 11 11
year 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ... 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015
calmonth 8-2017 8-2017 7-2017 7-2017 7-2017 7-2017 7-2017 7-2017 7-2017 7-2017 ... 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015
day_of_week Tuesday Tuesday Monday Sunday Saturday Saturday Friday Friday Thursday Wednesday ... Monday Monday Monday Monday Monday Monday Monday Sunday Sunday Sunday

17 rows × 2356 columns

The cleaning process for the twitter archive file is done

In [63]:
df_ta_fullclean = df_ta_time.copy() 

2.2.3 Assessing Twitter API data

Test:

In [64]:
df_api.head(5).T
Out[64]:
id 892177421306343426 891689557279858688 891087950875897856 890729181411237888 890240255349198849
created_at Tue Aug 01 00:17:27 +0000 2017 Sun Jul 30 15:58:51 +0000 2017 Sat Jul 29 00:08:17 +0000 2017 Fri Jul 28 00:22:40 +0000 2017 Wed Jul 26 15:59:51 +0000 2017
id_str 892177421306343426 891689557279858688 891087950875897856 890729181411237888 890240255349198849
text This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boop… https://t.co/aQFSeaCu9L This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG:… https://t.co/xx5cilW0Dd When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13… https://t.co/hrcFOGi12V This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant… https://t.co/l3TSS3o2M0
truncated True False True True True
source <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a> <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
user WeRateDogs® WeRateDogs® WeRateDogs® WeRateDogs® WeRateDogs®
retweet_count 5531 7618 2752 16647 6454
favorite_count 30525 38575 18580 59463 29174
lang en en en en en

Trough the API we have collected more data than we actually need, we have the same information from the twitter archive.

The columns id_str, text and source are redundant and hence we should drop them.

The columns lang and user are categorical information with a single unique value on the data, so it is not an informative columns, so we will also drop them.

Finally truncated offers new information but is not really relevant, so it will also be dropped.

In conclusion we will keep the columns:

  • id
  • retweet_count
  • favorite_count

Those columns do not cointain NaNs.

Test:

In [65]:
set(df_api.lang)
Out[65]:
{'en', 'es', 'et', 'nl', 'ro', 'und'}
In [66]:
set(df_api.user)
Out[66]:
{'WeRateDogs®'}
In [67]:
set(df_api.favorite_count.isnull())
Out[67]:
{False}
In [68]:
set(df_api.retweet_count.isnull())
Out[68]:
{False}

2.2.4 Cleaning Twitter API data

Code:

In [69]:
df_api_fullclean = df_api.reset_index()[["id", "retweet_count", "favorite_count"]].copy()
In [70]:
df_api_fullclean
Out[70]:
id retweet_count favorite_count
0 892177421306343426 5531 30525
1 891689557279858688 7618 38575
2 891087950875897856 2752 18580
3 890729181411237888 16647 59463
4 890240255349198849 6454 29174
... ... ... ...
1160 666058600524156928 51 103
1161 666055525042405380 213 402
1162 666050758794694657 51 119
1163 666044226329800704 123 263
1164 666029285002620928 41 118

1165 rows × 3 columns

2.2.5 Assessing Image Predictions file

Test:

In [71]:
df_pred.head(5)
Out[71]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
0 666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg 1 Welsh_springer_spaniel 0.465074 True collie 0.156665 True Shetland_sheepdog 0.061428 True
1 666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg 1 redbone 0.506826 True miniature_pinscher 0.074192 True Rhodesian_ridgeback 0.072010 True
2 666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg 1 German_shepherd 0.596461 True malinois 0.138584 True bloodhound 0.116197 True
3 666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg 1 Rhodesian_ridgeback 0.408143 True redbone 0.360687 True miniature_pinscher 0.222752 True
4 666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg 1 miniature_pinscher 0.560311 True Rottweiler 0.243682 True Doberman 0.154629 True

Recall that this dataset contains a tweet_id and the results of a classification method to inform breeds of dogs.

Our goal in this step will be to get a table with a unique mapping between tweet_id and a breed of dog. The neural network is able to make predictions of not only breads of dogs, that is why there is a column informing if a prediction category is or is not a dog.

The predicting model has given three outputs with their corresponding prediction confidence. The task of mapping a tweet_id with a dog breed is not as trivial as getting the most confident predicted category because this predicted category is not necessarly a dog.

  1. Predicting not dogs breeds categories can happen due to two different reasons:

    • The image was not framing a dog.

      Example the tweet_id: 666051853826850816 has as most confident prediction box_turtle which is accurate and not a dog.

    • The image was framing a dog but the neural network was not able to predict the bread of the dog and focused on something else.

      Example the tweet_id: 666268910803644416 has as predictions desktop_computer, desk and bookcase (which none of them are dog breeds)

  2. Predicting dog breeds correctly but not as a first choice Example the tweet_id: 666057090499244032 has as most confident prediction: shopping_cart with 96.2% confidence, shopping_basket with 1.4% confidence and finally golden_retreiver with 0.8% confidence.

In [72]:
df_pred[df_pred.p1_dog == False].head(5)
Out[72]:
tweet_id jpg_url img_num p1 p1_conf p1_dog p2 p2_conf p2_dog p3 p3_conf p3_dog
6 666051853826850816 https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg 1 box_turtle 0.933012 False mud_turtle 0.045885 False terrapin 0.017885 False
8 666057090499244032 https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg 1 shopping_cart 0.962465 False shopping_basket 0.014594 False golden_retriever 0.007959 True
17 666104133288665088 https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg 1 hen 0.965932 False cock 0.033919 False partridge 0.000052 False
18 666268910803644416 https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg 1 desktop_computer 0.086502 False desk 0.085547 False bookcase 0.079480 False
21 666293911632134144 https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg 1 three-toed_sloth 0.914671 False otter 0.015250 False great_grey_owl 0.013207 False
  • In conclusion this data tidriness issue is that a conclusion (the most confident dog predicted breed) is stored across multiple columns, we need to create a column containing the logic conclusion of those multiple columns.

Note also that we are not using at all the column df_pred, some tweets use more than one image but the predictive model has been run only on one of thse, so this information is not valuable for further analysis.

Test:

In [73]:
set(df_pred.img_num)
Out[73]:
{1, 2, 3, 4}
In [74]:
len(set(df_pred.tweet_id))==df_pred.shape[0] #check if the number of rows is the same of the number of unique ids
Out[74]:
True

2.2.6 Cleaning Image Predictions file

The goal is to obtain the dog breed information for each tweet_id, this information is stored across multiple columns.

The implementation consist on if the most confident category is a dog breed, get the first predicted category, else if the second category is a dog breed get the second instead, else if the third is a dog breed use the third predicted category and if it is not a dog breed inform the predicted dog breed category as Unknown:

Code:

In [75]:
def get_dog_predicted_category(row):
    if row["p1_dog"]:
        return(row["p1"])
    elif row["p2_dog"]:
        return(row["p2"])
    elif row["p3_dog"]:
        return(row["p3"])
    else:
        return("Unknown")
    
df_pred.apply(get_dog_predicted_category, axis = 1)
Out[75]:
0       Welsh_springer_spaniel
1                      redbone
2              German_shepherd
3          Rhodesian_ridgeback
4           miniature_pinscher
                 ...          
2070                    basset
2071        Labrador_retriever
2072                 Chihuahua
2073                 Chihuahua
2074                   Unknown
Length: 2075, dtype: object
In [76]:
df_pred_clean = pd.DataFrame()
df_pred_clean["tweet_id"] = df_pred["tweet_id"]
df_pred_clean["jpg_url"] = df_pred["jpg_url"]
df_pred_clean["dog_breed"] = df_pred.apply(get_dog_predicted_category, axis = 1)

df_pred_clean.set_index("tweet_id", inplace = True)
df_pred_clean
Out[76]:
jpg_url dog_breed
tweet_id
666020888022790149 https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg Welsh_springer_spaniel
666029285002620928 https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg redbone
666033412701032449 https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg German_shepherd
666044226329800704 https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg Rhodesian_ridgeback
666049248165822465 https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg miniature_pinscher
... ... ...
891327558926688256 https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg basset
891689557279858688 https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg Labrador_retriever
891815181378084864 https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg Chihuahua
892177421306343426 https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg Chihuahua
892420643555336193 https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg Unknown

2075 rows × 2 columns

Before the dog_breed conclusion for each tweet_id with the rest of the data form other sources it would be interestsing to analyze how many tweets we have not been able to classify as a dog breed.

Test:

In [77]:
df_pred_clean.groupby(['dog_breed']).count().rename(columns={"jpg_url": "counts"}).sort_values(by = "counts", ascending = False)
Out[77]:
counts
dog_breed
Unknown 324
golden_retriever 173
Labrador_retriever 113
Pembroke 96
Chihuahua 95
... ...
EntleBucher 1
Scotch_terrier 1
standard_schnauzer 1
Bouvier_des_Flandres 1
clumber 1

114 rows × 1 columns

In [78]:
df_gbreeds = df_pred_clean.groupby(['dog_breed']).count().rename(columns={"jpg_url": "counts"}).sort_values(by = "counts", ascending = False)
df_gbreeds["count_percentage"] = 100* df_gbreeds.counts / sum(df_gbreeds.counts)
df_gbreeds
Out[78]:
counts count_percentage
dog_breed
Unknown 324 15.614458
golden_retriever 173 8.337349
Labrador_retriever 113 5.445783
Pembroke 96 4.626506
Chihuahua 95 4.578313
... ... ...
EntleBucher 1 0.048193
Scotch_terrier 1 0.048193
standard_schnauzer 1 0.048193
Bouvier_des_Flandres 1 0.048193
clumber 1 0.048193

114 rows × 2 columns

Only 15% of the predictions can not be classified as dog breeds, i.e. 85% of the predictions have been succesfully mapped to a dog breed.

15% of entries is reasonably small to be able to discard it and do not take it into account for further analysis.

Unfortunately we can not access easily if the prediction is accurate and we do not have the validation/test accuracies of the predictive model, so for simplicity we will assume from now on that the most confident dog_breed is the ground truth class.

Finally we are done cleaning the predictions file:

Code:

In [79]:
df_pred_fullclean = df_pred_clean.copy()

2.2.7 Merging

Next we will merge the three following datafrmaes into a single one:

- df_ta_fullclean
- df_api_fullclean
- df_pred_fullclean

Test:

In [80]:
df_ta_fullclean.head(5).T
Out[80]:
0 1 2 3 4
tweet_id 892420643555336193 892177421306343426 891815181378084864 891689557279858688 891327558926688256
timestamp 2017-08-01 16:23:56 +0000 2017-08-01 00:17:27 +0000 2017-07-31 00:18:03 +0000 2017-07-30 15:58:51 +0000 2017-07-29 16:00:24 +0000
source iphone iphone iphone iphone iphone
text This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f
expanded_urls https://twitter.com/dog_rates/status/892420643555336193/photo/1 https://twitter.com/dog_rates/status/892177421306343426/photo/1 https://twitter.com/dog_rates/status/891815181378084864/photo/1 https://twitter.com/dog_rates/status/891689557279858688/photo/1 https://twitter.com/dog_rates/status/891327558926688256/photo/1
name Phineas Tilly Archie Darla Franklin
dog_stage Unknown Unknown Unknown Unknown Unknown
rating 130 130 120 130 120
names Phineas Tilly Archie Darla Franklin
date 2017-08-01 2017-08-01 2017-07-31 2017-07-30 2017-07-29
time 16:23:56 00:17:27 00:18:03 15:58:51 16:00:24
hour 16 0 0 15 16
day 1 1 31 30 29
month 8 8 7 7 7
year 2017 2017 2017 2017 2017
calmonth 8-2017 8-2017 7-2017 7-2017 7-2017
day_of_week Tuesday Tuesday Monday Sunday Saturday
In [81]:
df_api_fullclean.head(5).T
Out[81]:
0 1 2 3 4
id 892177421306343426 891689557279858688 891087950875897856 890729181411237888 890240255349198849
retweet_count 5531 7618 2752 16647 6454
favorite_count 30525 38575 18580 59463 29174
In [82]:
df_pred_fullclean.head(5).T
Out[82]:
tweet_id 666020888022790149 666029285002620928 666033412701032449 666044226329800704 666049248165822465
jpg_url https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg
dog_breed Welsh_springer_spaniel redbone German_shepherd Rhodesian_ridgeback miniature_pinscher

Code:

In [83]:
df_merge1 = df_ta_fullclean.merge(df_api_fullclean,how = "left", left_on = "tweet_id", right_on = "id" ).drop(columns = "id")
In [84]:
df_fullmerge = df_merge1.merge(df_pred_fullclean, how = "left", left_on = "tweet_id", right_index = True)
In [85]:
df_fullmerge.T
Out[85]:
0 1 2 3 4 5 6 7 8 9 ... 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355
tweet_id 892420643555336193 892177421306343426 891815181378084864 891689557279858688 891327558926688256 891087950875897856 890971913173991426 890729181411237888 890609185150312448 890240255349198849 ... 666058600524156928 666057090499244032 666055525042405380 666051853826850816 666050758794694657 666049248165822465 666044226329800704 666033412701032449 666029285002620928 666020888022790149
timestamp 2017-08-01 16:23:56 +0000 2017-08-01 00:17:27 +0000 2017-07-31 00:18:03 +0000 2017-07-30 15:58:51 +0000 2017-07-29 16:00:24 +0000 2017-07-29 00:08:17 +0000 2017-07-28 16:27:12 +0000 2017-07-28 00:22:40 +0000 2017-07-27 16:25:51 +0000 2017-07-26 15:59:51 +0000 ... 2015-11-16 01:01:59 +0000 2015-11-16 00:55:59 +0000 2015-11-16 00:49:46 +0000 2015-11-16 00:35:11 +0000 2015-11-16 00:30:50 +0000 2015-11-16 00:24:50 +0000 2015-11-16 00:04:52 +0000 2015-11-15 23:21:54 +0000 2015-11-15 23:05:30 +0000 2015-11-15 22:32:08 +0000
source iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone ... iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone
text This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\r\n\r\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A ... Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj
expanded_urls https://twitter.com/dog_rates/status/892420643555336193/photo/1 https://twitter.com/dog_rates/status/892177421306343426/photo/1 https://twitter.com/dog_rates/status/891815181378084864/photo/1 https://twitter.com/dog_rates/status/891689557279858688/photo/1 https://twitter.com/dog_rates/status/891327558926688256/photo/1 https://twitter.com/dog_rates/status/891087950875897856/photo/1 https://gofundme.com/ydvmve-surgery-for-jax https://twitter.com/dog_rates/status/890729181411237888/photo/1 https://twitter.com/dog_rates/status/890609185150312448/photo/1 https://twitter.com/dog_rates/status/890240255349198849/photo/1 ... https://twitter.com/dog_rates/status/666058600524156928/photo/1 https://twitter.com/dog_rates/status/666057090499244032/photo/1 https://twitter.com/dog_rates/status/666055525042405380/photo/1 https://twitter.com/dog_rates/status/666051853826850816/photo/1 https://twitter.com/dog_rates/status/666050758794694657/photo/1 https://twitter.com/dog_rates/status/666049248165822465/photo/1 https://twitter.com/dog_rates/status/666044226329800704/photo/1 https://twitter.com/dog_rates/status/666033412701032449/photo/1 https://twitter.com/dog_rates/status/666029285002620928/photo/1 https://twitter.com/dog_rates/status/666020888022790149/photo/1
name Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... the a a an a None a a a None
dog_stage Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown doggo ... Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
rating 130 130 120 130 120 130 130 130 130 140 ... 80 90 100 20 100 50 60 90 70 80
names Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... Unknown Unknown Unknown Unknown Unknown None Unknown Unknown Unknown None
date 2017-08-01 2017-08-01 2017-07-31 2017-07-30 2017-07-29 2017-07-29 2017-07-28 2017-07-28 2017-07-27 2017-07-26 ... 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-15 2015-11-15 2015-11-15
time 16:23:56 00:17:27 00:18:03 15:58:51 16:00:24 00:08:17 16:27:12 00:22:40 16:25:51 15:59:51 ... 01:01:59 00:55:59 00:49:46 00:35:11 00:30:50 00:24:50 00:04:52 23:21:54 23:05:30 22:32:08
hour 16 0 0 15 16 0 16 0 16 15 ... 1 0 0 0 0 0 0 23 23 22
day 1 1 31 30 29 29 28 28 27 26 ... 16 16 16 16 16 16 16 15 15 15
month 8 8 7 7 7 7 7 7 7 7 ... 11 11 11 11 11 11 11 11 11 11
year 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ... 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015
calmonth 8-2017 8-2017 7-2017 7-2017 7-2017 7-2017 7-2017 7-2017 7-2017 7-2017 ... 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015
day_of_week Tuesday Tuesday Monday Sunday Saturday Saturday Friday Friday Thursday Wednesday ... Monday Monday Monday Monday Monday Monday Monday Sunday Sunday Sunday
retweet_count NaN 5531 NaN 7618 NaN 2752 NaN 16647 NaN 6454 ... 51 NaN 213 NaN 51 NaN 123 NaN 41 NaN
favorite_count NaN 30525 NaN 38575 NaN 18580 NaN 59463 NaN 29174 ... 103 NaN 402 NaN 119 NaN 263 NaN 118 NaN
jpg_url https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg ... https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg
dog_breed Unknown Chihuahua Chihuahua Labrador_retriever basset Chesapeake_Bay_retriever Appenzeller Pomeranian Irish_terrier Pembroke ... miniature_poodle golden_retriever chow Unknown Bernese_mountain_dog miniature_pinscher Rhodesian_ridgeback German_shepherd redbone Welsh_springer_spaniel

21 rows × 2356 columns

2.3 Storing

Code:

In [86]:
df_fullmerge.to_csv("data/twitter_archive_master.csv", index=False)

3.Data Visualization

Code:

In [87]:
#read data
df_fullmerge = pd.read_csv("data/twitter_archive_master.csv", )
In [88]:
df_fullmerge.T
Out[88]:
0 1 2 3 4 5 6 7 8 9 ... 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355
tweet_id 892420643555336193 892177421306343426 891815181378084864 891689557279858688 891327558926688256 891087950875897856 890971913173991426 890729181411237888 890609185150312448 890240255349198849 ... 666058600524156928 666057090499244032 666055525042405380 666051853826850816 666050758794694657 666049248165822465 666044226329800704 666033412701032449 666029285002620928 666020888022790149
timestamp 2017-08-01 16:23:56 +0000 2017-08-01 00:17:27 +0000 2017-07-31 00:18:03 +0000 2017-07-30 15:58:51 +0000 2017-07-29 16:00:24 +0000 2017-07-29 00:08:17 +0000 2017-07-28 16:27:12 +0000 2017-07-28 00:22:40 +0000 2017-07-27 16:25:51 +0000 2017-07-26 15:59:51 +0000 ... 2015-11-16 01:01:59 +0000 2015-11-16 00:55:59 +0000 2015-11-16 00:49:46 +0000 2015-11-16 00:35:11 +0000 2015-11-16 00:30:50 +0000 2015-11-16 00:24:50 +0000 2015-11-16 00:04:52 +0000 2015-11-15 23:21:54 +0000 2015-11-15 23:05:30 +0000 2015-11-15 22:32:08 +0000
source iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone ... iphone iphone iphone iphone iphone iphone iphone iphone iphone iphone
text This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ This is Franklin. He would like you to stop calling him "cute." He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f Here we have a majestic great white breaching off South Africa's coast. Absolutely h*ckin breathtaking. 13/10 (IG: tucker_marlo) #BarkWeek https://t.co/kQ04fDDRmh Meet Jax. He enjoys ice cream so much he gets nervous around it. 13/10 help Jax enjoy more things by clicking below\r\n\r\nhttps://t.co/Zr4hWfAs1H https://t.co/tVJBRMnhxl When you watch your owner call another dog a good boy but then they turn back to you and say you're a great boy. 13/10 https://t.co/v0nONBcwxq This is Zoey. She doesn't want to be one of the scary sharks. Just wants to be a snuggly pettable boatpet. 13/10 #BarkWeek https://t.co/9TwLuAGH0b This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A ... Here is the Rand Paul of retrievers folks! He's probably good at poker. Can drink beer (lol rad). 8/10 good dog https://t.co/pYAJkAe76p My oh my. This is a rare blond Canadian terrier on wheels. Only $8.98. Rather docile. 9/10 very rare https://t.co/yWBqbrzy8O Here is a Siberian heavily armored polar bear mix. Strong owner. 10/10 I would do unspeakable things to pet this dog https://t.co/rdivxLiqEt This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj
expanded_urls https://twitter.com/dog_rates/status/892420643555336193/photo/1 https://twitter.com/dog_rates/status/892177421306343426/photo/1 https://twitter.com/dog_rates/status/891815181378084864/photo/1 https://twitter.com/dog_rates/status/891689557279858688/photo/1 https://twitter.com/dog_rates/status/891327558926688256/photo/1 https://twitter.com/dog_rates/status/891087950875897856/photo/1 https://gofundme.com/ydvmve-surgery-for-jax https://twitter.com/dog_rates/status/890729181411237888/photo/1 https://twitter.com/dog_rates/status/890609185150312448/photo/1 https://twitter.com/dog_rates/status/890240255349198849/photo/1 ... https://twitter.com/dog_rates/status/666058600524156928/photo/1 https://twitter.com/dog_rates/status/666057090499244032/photo/1 https://twitter.com/dog_rates/status/666055525042405380/photo/1 https://twitter.com/dog_rates/status/666051853826850816/photo/1 https://twitter.com/dog_rates/status/666050758794694657/photo/1 https://twitter.com/dog_rates/status/666049248165822465/photo/1 https://twitter.com/dog_rates/status/666044226329800704/photo/1 https://twitter.com/dog_rates/status/666033412701032449/photo/1 https://twitter.com/dog_rates/status/666029285002620928/photo/1 https://twitter.com/dog_rates/status/666020888022790149/photo/1
name Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... the a a an a None a a a None
dog_stage Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown doggo ... Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown Unknown
rating 130 130 120 130 120 130 130 130 130 140 ... 80 90 100 20 100 50 60 90 70 80
names Phineas Tilly Archie Darla Franklin None Jax None Zoey Cassie ... Unknown Unknown Unknown Unknown Unknown None Unknown Unknown Unknown None
date 2017-08-01 2017-08-01 2017-07-31 2017-07-30 2017-07-29 2017-07-29 2017-07-28 2017-07-28 2017-07-27 2017-07-26 ... 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-16 2015-11-15 2015-11-15 2015-11-15
time 16:23:56 00:17:27 00:18:03 15:58:51 16:00:24 00:08:17 16:27:12 00:22:40 16:25:51 15:59:51 ... 01:01:59 00:55:59 00:49:46 00:35:11 00:30:50 00:24:50 00:04:52 23:21:54 23:05:30 22:32:08
hour 16 0 0 15 16 0 16 0 16 15 ... 1 0 0 0 0 0 0 23 23 22
day 1 1 31 30 29 29 28 28 27 26 ... 16 16 16 16 16 16 16 15 15 15
month 8 8 7 7 7 7 7 7 7 7 ... 11 11 11 11 11 11 11 11 11 11
year 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ... 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015
calmonth 8-2017 8-2017 7-2017 7-2017 7-2017 7-2017 7-2017 7-2017 7-2017 7-2017 ... 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015 11-2015
day_of_week Tuesday Tuesday Monday Sunday Saturday Saturday Friday Friday Thursday Wednesday ... Monday Monday Monday Monday Monday Monday Monday Sunday Sunday Sunday
retweet_count NaN 5531 NaN 7618 NaN 2752 NaN 16647 NaN 6454 ... 51 NaN 213 NaN 51 NaN 123 NaN 41 NaN
favorite_count NaN 30525 NaN 38575 NaN 18580 NaN 59463 NaN 29174 ... 103 NaN 402 NaN 119 NaN 263 NaN 118 NaN
jpg_url https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg ... https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg
dog_breed Unknown Chihuahua Chihuahua Labrador_retriever basset Chesapeake_Bay_retriever Appenzeller Pomeranian Irish_terrier Pembroke ... miniature_poodle golden_retriever chow Unknown Bernese_mountain_dog miniature_pinscher Rhodesian_ridgeback German_shepherd redbone Welsh_springer_spaniel

21 rows × 2356 columns

3.1 What is the relationship between retweets count and favorites count?

They seem to be linearly directly correlated, i.e. tweets with a lot of retweets have also a lot of favorites.

In [89]:
!pip install imgkit
Requirement already satisfied: imgkit in c:\users\xavie\anaconda3\envs\pfm\lib\site-packages (1.0.2)
In [90]:
import altair_saver
source = df_fullmerge

chart = alt.Chart(source).mark_circle(size=100).encode(
    x='retweet_count',
    y='favorite_count',
    color='dog_stage',
    tooltip=['tweet_id','dog_stage','retweet_count','favorite_count']
).interactive().properties(
    width=650,
    height=400
)

chart.save('img/scatter.html')
chart
Out[90]:

3.2 At which days and hours are tweets published?

December 2015 was the month with more tweets published, from 1:00 to 5:00 hours.

In [91]:
heat_count = alt.Chart(df_fullmerge).mark_rect().encode(
    alt.X('hours(timestamp):O', title='hour of day'),
    alt.Y('yearmonth(timestamp):O', title='date'),
    alt.Color('count(tweet_id):Q', title='Count of tweets'),
    alt.Tooltip(['yearmonth(timestamp)','hours(timestamp)','count(tweet_id):Q'])
)

heat_count.save('img/calendar_counts.html')


heat_count
Out[91]:

3.3 At which days and hours where published the most retweeted tweets?

It is interesting to notice that the amount of acumulated retweets is not liked with the amount of tweets, the tweets that had most virality were published in Jan 2017 at 3:00 and Jan 2016 at 20:00.

In [92]:
heat_retweet = alt.Chart(df_fullmerge).mark_rect().encode(
    alt.X('hours(timestamp):O', title='hour of day'),
    alt.Y('yearmonth(timestamp):O', title='date'),
    alt.Color('sum(retweet_count):Q', title='Count of content retweets'),
    alt.Tooltip(['yearmonth(timestamp)','hours(timestamp)','sum(retweet_count)'])
)

heat_retweet.save('img/calendar_retweet.html')


heat_retweet
Out[92]:

3.4 Which are the most retweeted dog breeds over time?

Virality does not seem to be related to the dog breed, we can find outliers in any dog breed that went viral, for example labrador retriever (Jun 2016, 75153) outperformed by far all the other months for the same dog_breed.

Furthermore there are less published dog breeds like Eskimo dog (Jun 2016, 55956) or standard poodle (Jan 2017, 36372 retweets) that are rarely published but when they go viral they perform really well

In [93]:
import altair as alt
from vega_datasets import data

source = df_fullmerge

chart = alt.Chart(source).mark_circle(
    opacity=0.8,
    stroke='black',
    strokeWidth=1
).encode(
    alt.X('yearmonth(timestamp):O', axis=alt.Axis(labelAngle=0)),
    alt.Y('dog_breed:N'),
    alt.Size('retweet_count:Q',
        scale=alt.Scale(range=[0, 4000]),
        legend=alt.Legend(title='Annual Global Deaths')
    ),
    alt.Tooltip(['dog_breed','yearmonth(timestamp)','names','tweet_id:N','retweet_count']),
    alt.Color('dog_breed:N', legend=None)
).properties(
    width=500,
    height=1000
).transform_filter(
    alt.datum.Entity != 'All natural disasters'
)

chart.save('img/bubles.html')

chart
Out[93]:

3.5 What is the distribution of Retweets?

Most of tweets seem to not get viral at all, the tweets that go viral are few but they go really farm from the bunch when they are viral.

In [94]:
#df_fullmerge[["retweet_count","favorite_count"]].hist(bins = 50, figsize = [15,4])
In [95]:
x = df_fullmerge["retweet_count"]

num_bins = 50

fig, ax = plt.subplots()
fig.set_size_inches(12,5)

# the histogram of the data
n, bins, patches = ax.hist(x, num_bins, density=0)


ax.set_xlabel('Retweet Count')
ax.set_ylabel('Count Tweets')
ax.set_title('Histogram of Retweet Count')

plt.savefig('img/hist_retweets.png')
plt.show()
C:\Users\xavie\Anaconda3\envs\PFM\lib\site-packages\numpy\lib\histograms.py:839: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
C:\Users\xavie\Anaconda3\envs\PFM\lib\site-packages\numpy\lib\histograms.py:840: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)

3.6 Which are the dog breeds with more accumulated retweets?

  • Golden Retriever (173 tweets)
  • Labrador Retriever (113 tweets)
  • Chihuahua (95 tweets)

Are the top three informed dog breeds in our dataset.

In [96]:
import altair as alt
from vega_datasets import data

source = df_fullmerge

chart = alt.Chart(source).mark_bar().encode(
    y='sum(retweet_count):Q',
    tooltip = ['dog_breed','sum(retweet_count)', 'count(tweet_id)'],
    x=alt.X('dog_breed:N', sort='-y')
)


chart.save('img/top_dog_breeds.html')
chart
Out[96]: